Message 327497 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	docs@python, eric.snow, ncoghlan, vstinner
Date	2018-10-10.21:43:37
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1539207818.03.0.788709270274.issue34914@psf.upfronthosting.co.za>
In-reply-to

Content
I'm not sure that I understand your issue. There are 3 ways to enable the UTF-8 Mode: * if the LC_CTYPE locale is "C" or "POSIX" * if PYTHONUTF8 env var is equal to "1" * using -X utf8 or -X utf8=1 command line option For the first 2 cases are fine if the locale encoding is gb18030. For the command line argument, first Python decodes the command line from gb18030. If -X utf8 is present, the command line is decoded again from UTF-8 (and the old configuration is removed, to parse the new configuration). I understand that your question if is decoding the command line argument from gb18030 can miss -X utf8 or enable UTF-8 by mistake. It seems like gb18030 encodes "-X utf8" text the same way than ASCII: >>> "-X utf8".encode("gb18030") b'-X utf8' >>> b'-X utf8'.decode("gb18030") '-X utf8' I'm aware of mojibake causing a security issue, but it was for a function checking for a single byte, not a substring: https://unicodebook.readthedocs.io/issues.html#check-byte-strings-before-decoding-them-to-character-strings I don't know well gb18030, so maybe I missed something. To me, using gb18030 with the UTF-8 mode doesn't seem to cause any issue to decode the command line arguments.

I'm not sure that I understand your issue. There are 3 ways to enable the UTF-8 Mode:

* if the LC_CTYPE locale is "C" or "POSIX"
* if PYTHONUTF8 env var is equal to "1"
* using -X utf8 or -X utf8=1 command line option

For the first 2 cases are fine if the locale encoding is gb18030.

For the command line argument, first Python decodes the command line from gb18030. If -X utf8 is present, the command line is decoded again from UTF-8 (and the old configuration is removed, to parse the new configuration).

I understand that your question if is decoding the command line argument from gb18030 can miss -X utf8 or enable UTF-8 by mistake.

It seems like gb18030 encodes "-X utf8" text the same way than ASCII:

>>> "-X utf8".encode("gb18030")
b'-X utf8'
>>> b'-X utf8'.decode("gb18030")
'-X utf8'

I'm aware of mojibake causing a security issue, but it was for a function checking for a single byte, not a substring:

https://unicodebook.readthedocs.io/issues.html#check-byte-strings-before-decoding-them-to-character-strings

I don't know well gb18030, so maybe I missed something. To me, using gb18030 with the UTF-8 mode doesn't seem to cause any issue to decode the command line arguments.

History
Date	User	Action	Args
2018-10-10 21:43:38	vstinner	set	recipients: + vstinner, ncoghlan, docs@python, eric.snow
2018-10-10 21:43:38	vstinner	set	messageid: <1539207818.03.0.788709270274.issue34914@psf.upfronthosting.co.za>
2018-10-10 21:43:38	vstinner	link	issue34914 messages
2018-10-10 21:43:37	vstinner	create