Issue 24036: GB2312 codec is using a wrong covert table

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68224

classification

Title:	GB2312 codec is using a wrong covert table
Type:	behavior	Stage:	patch review
Components:	Unicode	Versions:	Python 3.6, Python 3.5

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Artoria2e5, ezio.melotti, lemburg, loewis, malin, vstinner
Priority:	normal	Keywords:	patch

Created on 2015-04-23 09:29 by malin, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fixgb2312.patch	malin, 2015-04-23 09:29	fix GB2312 codec

Messages (9)
msg241858 - (view)	Author: Ma Lin (malin) *	Date: 2015-04-23 09:29
While I was trying to optimize GB2312/GBK/GB18030-2000 codecs (they are three encodings that widely used in China), I found there is a bug. The three encodings, their relation should be: GB2312 ⊂ GBK ⊂ GB18030-2000. However, in Python's implement: GB2312 ⊄ GBK ⊂ GB18030-2000. GBK should be backward compatible with GB2312, but in Python's implement, it's not. ---- I digged into, I found the Python's GB2312 codec is using a wrong convert table. In this file /Modules/cjkcodecs/_codecs_cn.c , there is a comment block, I paste it here: /* GBK and GB2312 map differently in few code points that are listed below: * * gb2312 gbk * A1A4 U+30FB KATAKANA MIDDLE DOT U+00B7 MIDDLE DOT * A1AA U+2015 HORIZONTAL BAR U+2014 EM DASH * A844 undefined U+2015 HORIZONTAL BAR */ In fact the second column (GB2312 column) is wrong, this column should be deleted. The four involved unicode codepoints are: U+30FB ・ KATAKANA MIDDLE DOT U+00B7 · MIDDLE DOT U+2015 ― HORIZONTAL BAR U+2014 — EM DASH So, GB2312 codec decodes b'0xA1, 0xA4' to U+30FB. U+30FB is a Japanese symbol, but looks quite similar to U+00B7. I searched "GB2312 Unicode Table" with Google, there are right verson and wrong version on the Internet, unfortunately we are using the wrong verson. libiconv-1.14 is also using the wrong version. ---- Hold an example of bad behavior. Using GBK encoder, encode U+30FB to bytes, UnicodeEncodeError exception occurred, becase U+30FB is not in GBK. In Simplified Chinese version of Microsoft Windows, console's default encoding is GBK[1]. If using GB2312 decoder to decode b'0xA1, 0xA4', then print U+30FB to console, UnicodeEncodeError raised. Since DASH is a common character, this bug is annoying. ---- If we fix this, I don't know how many disasters will happen. However, if we don't fix this, it's a bug. I already made a patch, but I think we need a discussion, should we fix this? ----------------------- Annotate: [1] In fact console's default encoding is cp936, cp936 almost same as GBK, but not entirely same. Using GBK in here is not a problem.
msg241859 - (view)	Author: Ma Lin (malin) *	Date: 2015-04-23 09:43
"Since MIDDLE DOT is a common character, this bug is annoying." Sorry, it's MIDDLE DOT, not DASH.
msg241909 - (view)	Author: Ma Lin (malin) *	Date: 2015-04-24 04:19
Today, I investigated these popular programming languages, all are the latest version. iconv-1.14 wrong version php-5.6.8 wrong version (php is using iconv) ActivePerl-5.20.2 wrong version GoLang-1.4.2 no GB2312, only has GBK/GB18030 (golang.org/x/text/encoding) Java 1.7.0_79-b15 wrong version (java.nio.charset) .Net 2013 rignt version It seems Python should stay at the wrong version. Very sorry for waste your time. Appendix A: /* The right version's table should be: * * gb2312 gbk * A1A4 U+00B7 MIDDLE DOT U+00B7 MIDDLE DOT * A1AA U+2014 EM DASH U+2014 EM DASH * A844 undefined U+2015 HORIZONTAL BAR */ Appendix B: Advice for final user: 1, Use GBK as much as possible. 2, Be careful when you do interactive operation between GB2312 and GBK/GB18030.
msg241920 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-04-24 07:43
Hi Ma Lin, thank you for your investigation. In order to fix these tables, we'd need an official reference which shows that there is in fact an error. If most programming languages you have tested use the "wrong" version, then maybe it's not wrong after all :-) Adding a new mapping for A844 should not be a problem and the other mappings don't seem to introduce much difference in terms of how the glyphs look. However, by changing such mappings we'd break roundtrip safety.
msg241938 - (view)	Author: Ma Lin (malin) *	Date: 2015-04-24 12:41
Andre Lemburg, We don't need any modify, A844 is in GBK but not in GB2312, so no need to add it into GB2312. Your logic is right, it's hard to judge which one is wrong. But U+30FB (· KATAKANA MIDDLE DOT) and U+2015 (— HORIZONTAL BAR) have no reason among these Chinese common punctuation symbol. A1A2-A1B7: 、。・ ˉ ˇ ¨ 〃々 ― ～ ‖ … ‘ ’“ ” 〔〕〈〉《》 If they are U+00B7 (· MIDDLE DOT) and U+2014 (— EM DASH), this section looks more reasonable. GB2312 was published in early 1980s, it seems there was a historical accident. Luckily, most programming languages are on the same side.
msg257454 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-01-04 10:57
I think we can close this issue as "won't fix". It's a bug, but one which is present in a lot of other systems as well, so we'd potentially make it impossible to write GB2312 data which is supposed to be read back by these other systems. Ma Lin: Do you agree ?
msg257457 - (view)	Author: Ma Lin (malin) *	Date: 2016-01-04 11:30
I agree with you, "won't fix".
msg257458 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-01-04 11:37
Thanks, Ma Lin.
msg277926 - (view)	Author: Mingye Wang (Artoria2e5) *	Date: 2016-10-03 03:50
> Advice for final user: This seems something worthy of adding to the codecs doc as a footnote. Perhaps something like "(deprecated) ... gb2312 is an obsolete encoding from the 1980s. Use gbk or gb18030 instead." will do. > libiconv-1.14 is also using the wrong version. Just a side note on the right/wrongfulness of libiconv: I have reported the GB18030 incompatibility as a libiconv bug.[1] From the replies, I learnt that 1) what libiconv is using currently is a then-official mapping published on ftp.unicode.org; 2) vendor implementations of gb2312 differed historically. I have updated the corresponding section[2] on Wikipedia to include these old references. [1]: https://lists.gnu.org/archive/html/bug-gnu-libiconv/2016-09/msg00004.html [2]: https://en.wikipedia.org/wiki/GB_2312#Two_implementations_of_GB2312 Still, being old and common does not necessarily mean being correct, as Ma Lin have demonstrated by showing the character semantics. To reflect this in a better-supported manner, I have added names for the glyphs in question from GB2312-80 to [2].

History
Date	User	Action	Args
2022-04-11 14:58:16	admin	set	github: 68224
2016-10-03 03:50:57	Artoria2e5	set	nosy: + Artoria2e5 messages: + msg277926
2016-01-04 11:37:52	lemburg	set	status: open -> closed resolution: wont fix messages: + msg257458
2016-01-04 11:30:55	malin	set	messages: + msg257457
2016-01-04 10:57:27	lemburg	set	messages: + msg257454
2016-01-02 08:38:31	ezio.melotti	set	stage: patch review
2015-04-24 12:41:23	malin	set	messages: + msg241938
2015-04-24 07:43:53	lemburg	set	messages: + msg241920
2015-04-24 04:19:48	malin	set	messages: + msg241909
2015-04-23 13:43:22	serhiy.storchaka	set	nosy: + lemburg, loewis
2015-04-23 09:43:55	malin	set	messages: + msg241859
2015-04-23 09:29:41	malin	create