This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ncoghlan
Recipients docs@python, eric.snow, ncoghlan, vstinner
Date 2018-10-06.08:06:56
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1538813217.24.0.545547206417.issue34914@psf.upfronthosting.co.za>
In-reply-to
Content
While working on the docs updates for bpo-34589 (clarifying that "PYTHONCOERCECLOCALE=0" and "PYTHONCOERCELOCALE=warn" need both the environment variable name and the value to be encoded as ASCII in order to have any effect), I realised that it was less explicit how to reliably enable UTF-8 mode, since that can be enabled even when the current locale is a nominally ASCII-incompatible one like gb18030, and the command line settings get processed as wchar strings rather than 8-bit char strings.

From what I've been able to figure out, the environment variable case is the same as for locale coercion: both the environment variable name and the value need to be encoded as ASCII. This actually happens implicitly, as even encodings like gb18030 still encode ASCII letters and numbers the same way ASCII does - their incompatibilities with ASCII lie elsewhere. Fully incompatible encodings like UTF-16 and UTF-32 don't get used as locale encodings in the first place because they'd break too many applications.

I believe the same holds true for the command line arguments, just in the other direction: they get converted to wchar* with either mbstowcs or mrbtowc, and then compared using wcscmp or wcsncmp, but for all encodings that actually get used as locale encodings, the ASCII code points that CPython cares about get mapped directly to the corresponding UTF-16-LE or UTF-32 code point at both compile time (in the code) and at runtime (when reading the arg string).

Given that simply not thinking about the problem will actually do the right thing in all cases, I don't think this needs to be documented prominently, but I do think it would be good to explicitly address the point somewhere.
History
Date User Action Args
2018-10-06 08:06:57ncoghlansetrecipients: + ncoghlan, vstinner, docs@python, eric.snow
2018-10-06 08:06:57ncoghlansetmessageid: <1538813217.24.0.545547206417.issue34914@psf.upfronthosting.co.za>
2018-10-06 08:06:57ncoghlanlinkissue34914 messages
2018-10-06 08:06:56ncoghlancreate