classification
Title: python3 gettext.lgettext sometimes returns bytes, not string
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: barry, loewis, petri, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-03-08 09:17 by petri, last changed 2017-06-20 15:13 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 2266 merged serhiy.storchaka, 2017-06-18 11:34
PR 2297 merged serhiy.storchaka, 2017-06-20 14:17
PR 2298 merged serhiy.storchaka, 2017-06-20 14:18
Messages (7)
msg289220 - (view) Author: Petri Savolainen (petri) Date: 2017-03-08 09:17
On Debian stable (Python 3.4), with the LANGUAGE environment variable set to "C" or "en_US.UTF-8", the following produces a string:

d = gettext.textdomain('apt-listchanges')
print(gettext.lgettext("Informational notes"))

However, setting the language, for example fi_FI.UTF-8, it will output a bytes object. Same apparently happens with some other languages, too.

Why is this? The discrepancy is not documented anywhere, AFAIK. Is this a bug or intended behavior depending on some (undocumented) circumstances? Given both the above examples define UTF-8 as the encoding, the result value does not depend directly on the encoding. 

The docs say lgettext should merely return the translation in a particular encoding. It does not say the return value will be switched from a string to bytes as well.

I saw this originally in the Debian bug tracker and thought the issue merits at least clarification here as well (link to Debian bug below for reference).

(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818728)

No idea if this happens on Python > 3.4 or another platforms. I would guess so, but have not had time to confirm.
msg296268 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-18 11:35
In Python 2 both gettext() and lgettext() are purposed to return 8-bit strings. The difference between them is only that gettext() encodes the translation back to the encoding of the translation file if the output encoding is not explicitly specified, while lgettext() encodes it to the preferred locale encoding. ugettext() returns Unicode strings.

In Python 3 ugettext() is renamed to gettext() and always returns Unicode strings. lgettext() should return a byte string, as in Python 2. The problem is that if the translation is not found, the untranslated message usually is returned, which is a Unicode string in Python 3. It should be encoded to a byte string, so that lgettext() always returns the same type -- bytes.

PR 2266 fixes lgettext() and related functions, updates the documentation, and adds tests.

Frankly, the usefulness of lgettext() in Python 3 looks questionable to me. gettext() can be used instead, with explicit encoding the result to the desired charset.
msg296275 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-06-18 14:59
I agree with everything @serhiy.storchaka said, including the questionable utility of the l* methods in Python 3. ;)

Thanks also for updating the documentation.  Reading the existing docs over now, it's shocking how imprecise "the translation is returned in the preferred system encoding" is.

I have some suggestion about the PR, so I'll comment over there.
msg296434 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 14:13
New changeset 26cb4657bcc9a7adffa95798ececb588dddfeadb by Serhiy Storchaka in branch 'master':
bpo-29755: Fixed the lgettext() family of functions in the gettext module. (#2266)
https://github.com/python/cpython/commit/26cb4657bcc9a7adffa95798ececb588dddfeadb
msg296450 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 15:06
New changeset a1115e1a0454f0548f96cace6ee97b286dfa1c0d by Serhiy Storchaka in branch '3.6':
[3.6] bpo-29755: Fixed the lgettext() family of functions in the gettext module. (GH-2266) (#2297)
https://github.com/python/cpython/commit/a1115e1a0454f0548f96cace6ee97b286dfa1c0d
msg296451 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 15:07
New changeset 29c89d00bf4b57c5ee2aafe660002ce1b8cea176 by Serhiy Storchaka in branch '3.5':
[3.5] bpo-29755: Fixed the lgettext() family of functions in the gettext module. (GH-2266) (#2298)
https://github.com/python/cpython/commit/29c89d00bf4b57c5ee2aafe660002ce1b8cea176
msg296452 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 15:13
As for the original issue in the Debian bug tracker, lgettext() and ugettext() are two right ways (depending on how you format the output, as 8-bit strings or as Unicode strings) for doing localization in Python 2, but gettext() is the right way in Python 3.
History
Date User Action Args
2017-06-20 15:13:52serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg296452

stage: patch review -> resolved
2017-06-20 15:07:01serhiy.storchakasetmessages: + msg296451
2017-06-20 15:06:51serhiy.storchakasetmessages: + msg296450
2017-06-20 14:18:53serhiy.storchakasetpull_requests: + pull_request2346
2017-06-20 14:17:35serhiy.storchakasetpull_requests: + pull_request2345
2017-06-20 14:13:32serhiy.storchakasetmessages: + msg296434
2017-06-18 14:59:09barrysetmessages: + msg296275
2017-06-18 11:35:52serhiy.storchakasetnosy: + barry

messages: + msg296268
stage: patch review
2017-06-18 11:34:08serhiy.storchakasetpull_requests: + pull_request2317
2017-06-17 19:20:10serhiy.storchakasetassignee: serhiy.storchaka
2017-03-08 09:19:47serhiy.storchakasetnosy: + loewis, serhiy.storchaka

components: + Library (Lib)
versions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.4
2017-03-08 09:17:48petricreate