Message 277925 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Artoria2e5
Recipients	Artoria2e5, ezio.melotti, vstinner
Date	2016-10-03.03:11:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1475464291.36.0.0934021228231.issue28343@psf.upfronthosting.co.za>
In-reply-to

Content
Microsoft's cp936 defines a euro sign at 0x80, but Python would kick the bucket when asked to do something like `u'\u20ac'.encode('cp936')`. This may break things for zh-hans-cn windows users who wants to put a euro sign in their file name (if they insist on using a non-unicode str for open() in py2, well.) By looking at the codecs documentation, 'cp936' appears to be an alias for the GBK encoder, which by itself has been a very ambiguous name and subject to confusion -- The name "GBK" might refer to any of the four commonly-known members of the family of EUC-CN (gb2312) extensions that has full coverage of Unicode 1.1 CJK Unified Ideographs block: 1) The original GBK. Rust-Encoding says that it's in a normative annex of GB13000.1-1993, but the closest thing I can find in my archive.org copy of that standard is an annex on an EUC (GB/T 2311) UCS. 2) IANA GBK, or Microsoft cp936. This is the one with the euro sign I am looking for. 3) GBK 1.0, a recommendation from the official standardization committees based on cp936. It's roughly cp936 without the euro sign but with some additional 95 PUA code points. 4) W3C TR GBK. This GBK is basically gb18030-2005 without four-byte UTF, and with the euro sign. Roughly a union of 2) and 3) with some PUA code points moved into the right place. Looking at Modules/cjkcodecs/_codecs_cn.c @ 104259:36b052adf5a7, Python seems to be doing either 1) or 3). For a quick fix you can just make an additional cp936 encoding around the gbk encoding that handles U+20AC; for some excitement (of potentially breaking stuff) you can join the web people and use either 2) or 4).

Microsoft's cp936 defines a euro sign at 0x80, but Python would kick the bucket when asked to do something like `u'\u20ac'.encode('cp936')`. This may break things for zh-hans-cn windows users who wants to put a euro sign in their file name (if they insist on using a non-unicode str for open() in py2, well.)

By looking at the codecs documentation, 'cp936' appears to be an alias for the GBK encoder, which by itself has been a very ambiguous name and subject to confusion --

The name "GBK" might refer to any of the four commonly-known members of the family of EUC-CN (gb2312) extensions that has full coverage of Unicode 1.1 CJK Unified Ideographs block:
1) The original GBK. Rust-Encoding says that it's in a normative annex of GB13000.1-1993, but the closest thing I can find in my archive.org copy of that standard is an annex on an EUC (GB/T 2311) UCS.
2) IANA GBK, or Microsoft cp936. This is the one with the euro sign I am looking for.
3) GBK 1.0, a recommendation from the official standardization committees based on cp936. It's roughly cp936 without the euro sign but with some additional 95 PUA code points.
4) W3C TR GBK. This GBK is basically gb18030-2005 without four-byte UTF, and with the euro sign. Roughly a union of 2) and 3) with some PUA code points moved into the right place.

Looking at Modules/cjkcodecs/_codecs_cn.c @ 104259:36b052adf5a7, Python seems to be doing either 1) or 3). For a quick fix you can just make an additional cp936 encoding around the gbk encoding that handles U+20AC; for some excitement (of potentially breaking stuff) you can join the web people and use either 2) or 4).

History
Date	User	Action	Args
2016-10-03 03:11:31	Artoria2e5	set	recipients: + Artoria2e5, vstinner, ezio.melotti
2016-10-03 03:11:31	Artoria2e5	set	messageid: <1475464291.36.0.0934021228231.issue28343@psf.upfronthosting.co.za>
2016-10-03 03:11:31	Artoria2e5	link	issue28343 messages
2016-10-03 03:11:28	Artoria2e5	create