Message 241858 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	malin
Recipients	ezio.melotti, malin, vstinner
Date	2015-04-23.09:29:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1429781381.54.0.113233811407.issue24036@psf.upfronthosting.co.za>
In-reply-to

Content
While I was trying to optimize GB2312/GBK/GB18030-2000 codecs (they are three encodings that widely used in China), I found there is a bug. The three encodings, their relation should be: GB2312 ⊂ GBK ⊂ GB18030-2000. However, in Python's implement: GB2312 ⊄ GBK ⊂ GB18030-2000. GBK should be backward compatible with GB2312, but in Python's implement, it's not. ---- I digged into, I found the Python's GB2312 codec is using a wrong convert table. In this file /Modules/cjkcodecs/_codecs_cn.c , there is a comment block, I paste it here: /* GBK and GB2312 map differently in few code points that are listed below: * * gb2312 gbk * A1A4 U+30FB KATAKANA MIDDLE DOT U+00B7 MIDDLE DOT * A1AA U+2015 HORIZONTAL BAR U+2014 EM DASH * A844 undefined U+2015 HORIZONTAL BAR */ In fact the second column (GB2312 column) is wrong, this column should be deleted. The four involved unicode codepoints are: U+30FB ・ KATAKANA MIDDLE DOT U+00B7 · MIDDLE DOT U+2015 ― HORIZONTAL BAR U+2014 — EM DASH So, GB2312 codec decodes b'0xA1, 0xA4' to U+30FB. U+30FB is a Japanese symbol, but looks quite similar to U+00B7. I searched "GB2312 Unicode Table" with Google, there are right verson and wrong version on the Internet, unfortunately we are using the wrong verson. libiconv-1.14 is also using the wrong version. ---- Hold an example of bad behavior. Using GBK encoder, encode U+30FB to bytes, UnicodeEncodeError exception occurred, becase U+30FB is not in GBK. In Simplified Chinese version of Microsoft Windows, console's default encoding is GBK[1]. If using GB2312 decoder to decode b'0xA1, 0xA4', then print U+30FB to console, UnicodeEncodeError raised. Since DASH is a common character, this bug is annoying. ---- If we fix this, I don't know how many disasters will happen. However, if we don't fix this, it's a bug. I already made a patch, but I think we need a discussion, should we fix this? ----------------------- Annotate: [1] In fact console's default encoding is cp936, cp936 almost same as GBK, but not entirely same. Using GBK in here is not a problem.

While I was trying to optimize GB2312/GBK/GB18030-2000 codecs (they are three encodings that widely used in China), I found there is a bug.

The three encodings, their relation should be: GB2312 ⊂ GBK ⊂ GB18030-2000.
However, in Python's implement: GB2312 ⊄ GBK ⊂ GB18030-2000.
GBK should be backward compatible with GB2312, but in Python's implement, it's not.

----
I digged into, I found the Python's GB2312 codec is using a wrong convert table.
In this file /Modules/cjkcodecs/_codecs_cn.c , there is a comment block, I paste it here: 

/* GBK and GB2312 map differently in few code points that are listed below:
 *
 *              gb2312                          gbk
 * A1A4         U+30FB KATAKANA MIDDLE DOT      U+00B7 MIDDLE DOT
 * A1AA         U+2015 HORIZONTAL BAR           U+2014 EM DASH
 * A844         undefined                       U+2015 HORIZONTAL BAR
 */
 
 In fact the second column (GB2312 column) is wrong, this column should be deleted.
 
 The four involved unicode codepoints are:
 U+30FB  ・     KATAKANA MIDDLE DOT
 U+00B7  ·      MIDDLE DOT
 U+2015  ―      HORIZONTAL BAR
 U+2014  —      EM DASH

So, GB2312 codec decodes b'0xA1, 0xA4' to U+30FB.
U+30FB is a Japanese symbol, but looks quite similar to U+00B7.

I searched "GB2312 Unicode Table" with Google, there are right verson and wrong version on the Internet, unfortunately we are using the wrong verson.

libiconv-1.14 is also using the wrong version.

----
Hold an example of bad behavior. 

Using GBK encoder, encode U+30FB to bytes, UnicodeEncodeError exception occurred, becase U+30FB is not in GBK.

In Simplified Chinese version of Microsoft Windows, console's default encoding is GBK[1].
If using GB2312 decoder to decode b'0xA1, 0xA4', then print U+30FB to console, UnicodeEncodeError raised.
Since DASH is a common character, this bug is annoying.

----
If we fix this, I don't know how many disasters will happen.
However, if we don't fix this, it's a bug.

I already made a patch, but I think we need a discussion, should we fix this?

-----------------------
Annotate:
[1] In fact console's default encoding is cp936, cp936 almost same as GBK, but not entirely same. Using GBK in here is not a problem.

History
Date	User	Action	Args
2015-04-23 09:29:41	malin	set	recipients: + malin, vstinner, ezio.melotti
2015-04-23 09:29:41	malin	set	messageid: <1429781381.54.0.113233811407.issue24036@psf.upfronthosting.co.za>
2015-04-23 09:29:41	malin	link	issue24036 messages
2015-04-23 09:29:41	malin	create