classification
Title: Clarify text encoding used to enable UTF-8 mode
Type: enhancement Stage: needs patch
Components: Documentation Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, eric.snow, ncoghlan, vstinner
Priority: low Keywords:

Created on 2018-10-06 08:06 by ncoghlan, last changed 2018-10-18 08:06 by ncoghlan.

Messages (4)
msg327236 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-10-06 08:06
While working on the docs updates for bpo-34589 (clarifying that "PYTHONCOERCECLOCALE=0" and "PYTHONCOERCELOCALE=warn" need both the environment variable name and the value to be encoded as ASCII in order to have any effect), I realised that it was less explicit how to reliably enable UTF-8 mode, since that can be enabled even when the current locale is a nominally ASCII-incompatible one like gb18030, and the command line settings get processed as wchar strings rather than 8-bit char strings.

From what I've been able to figure out, the environment variable case is the same as for locale coercion: both the environment variable name and the value need to be encoded as ASCII. This actually happens implicitly, as even encodings like gb18030 still encode ASCII letters and numbers the same way ASCII does - their incompatibilities with ASCII lie elsewhere. Fully incompatible encodings like UTF-16 and UTF-32 don't get used as locale encodings in the first place because they'd break too many applications.

I believe the same holds true for the command line arguments, just in the other direction: they get converted to wchar* with either mbstowcs or mrbtowc, and then compared using wcscmp or wcsncmp, but for all encodings that actually get used as locale encodings, the ASCII code points that CPython cares about get mapped directly to the corresponding UTF-16-LE or UTF-32 code point at both compile time (in the code) and at runtime (when reading the arg string).

Given that simply not thinking about the problem will actually do the right thing in all cases, I don't think this needs to be documented prominently, but I do think it would be good to explicitly address the point somewhere.
msg327497 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-10-10 21:43
I'm not sure that I understand your issue. There are 3 ways to enable the UTF-8 Mode:

* if the LC_CTYPE locale is "C" or "POSIX"
* if PYTHONUTF8 env var is equal to "1"
* using -X utf8 or -X utf8=1 command line option

For the first 2 cases are fine if the locale encoding is gb18030.

For the command line argument, first Python decodes the command line from gb18030. If -X utf8 is present, the command line is decoded again from UTF-8 (and the old configuration is removed, to parse the new configuration).

I understand that your question if is decoding the command line argument from gb18030 can miss -X utf8 or enable UTF-8 by mistake.

It seems like gb18030 encodes "-X utf8" text the same way than ASCII:

>>> "-X utf8".encode("gb18030")
b'-X utf8'
>>> b'-X utf8'.decode("gb18030")
'-X utf8'

I'm aware of mojibake causing a security issue, but it was for a function checking for a single byte, not a substring:

https://unicodebook.readthedocs.io/issues.html#check-byte-strings-before-decoding-them-to-character-strings

I don't know well gb18030, so maybe I missed something. To me, using gb18030 with the UTF-8 mode doesn't seem to cause any issue to decode the command line arguments.
msg327498 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-10-10 21:47
Well, I'm not saying that using gb18030 with UTF-8 will be just fine for everything. Mojibake is likely around the corner :-) C locale coercion and the UTF-8 mode are workarounds for the crappy and wild Unix world :-)
msg327944 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-10-18 08:06
Your explanation is why this is a docs enhancement proposal rather than a
bug report: as far as we're aware, all encodings that get used as locale
encodings have the property that encoding "-X utf8" with the locale
encoding gives the same answer as encoding it with ASCII.

Encodings where this isn't true (like UTF-16-LE) don't get used as locale
encodings.
History
Date User Action Args
2018-10-18 08:06:11ncoghlansetmessages: + msg327944
2018-10-10 21:47:34vstinnersetmessages: + msg327498
2018-10-10 21:43:38vstinnersetmessages: + msg327497
2018-10-06 08:06:57ncoghlancreate