New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python3 gettext.lgettext sometimes returns bytes, not string #73941
Comments
On Debian stable (Python 3.4), with the LANGUAGE environment variable set to "C" or "en_US.UTF-8", the following produces a string: d = gettext.textdomain('apt-listchanges')
print(gettext.lgettext("Informational notes")) However, setting the language, for example fi_FI.UTF-8, it will output a bytes object. Same apparently happens with some other languages, too. Why is this? The discrepancy is not documented anywhere, AFAIK. Is this a bug or intended behavior depending on some (undocumented) circumstances? Given both the above examples define UTF-8 as the encoding, the result value does not depend directly on the encoding. The docs say lgettext should merely return the translation in a particular encoding. It does not say the return value will be switched from a string to bytes as well. I saw this originally in the Debian bug tracker and thought the issue merits at least clarification here as well (link to Debian bug below for reference). (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818728) No idea if this happens on Python > 3.4 or another platforms. I would guess so, but have not had time to confirm. |
In Python 2 both gettext() and lgettext() are purposed to return 8-bit strings. The difference between them is only that gettext() encodes the translation back to the encoding of the translation file if the output encoding is not explicitly specified, while lgettext() encodes it to the preferred locale encoding. ugettext() returns Unicode strings. In Python 3 ugettext() is renamed to gettext() and always returns Unicode strings. lgettext() should return a byte string, as in Python 2. The problem is that if the translation is not found, the untranslated message usually is returned, which is a Unicode string in Python 3. It should be encoded to a byte string, so that lgettext() always returns the same type -- bytes. PR 2266 fixes lgettext() and related functions, updates the documentation, and adds tests. Frankly, the usefulness of lgettext() in Python 3 looks questionable to me. gettext() can be used instead, with explicit encoding the result to the desired charset. |
I agree with everything @serhiy.storchaka said, including the questionable utility of the l* methods in Python 3. ;) Thanks also for updating the documentation. Reading the existing docs over now, it's shocking how imprecise "the translation is returned in the preferred system encoding" is. I have some suggestion about the PR, so I'll comment over there. |
As for the original issue in the Debian bug tracker, lgettext() and ugettext() are two right ways (depending on how you format the output, as 8-bit strings or as Unicode strings) for doing localization in Python 2, but gettext() is the right way in Python 3. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: