classification
Title: GB2312 codec is using a wrong covert table
Type: behavior Stage: patch review
Components: Unicode Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Artoria2e5, Ma Lin, ezio.melotti, lemburg, loewis, vstinner
Priority: normal Keywords: patch

Created on 2015-04-23 09:29 by Ma Lin, last changed 2016-10-03 03:50 by Artoria2e5. This issue is now closed.

Files
File name Uploaded Description Edit
fixgb2312.patch Ma Lin, 2015-04-23 09:29 fix GB2312 codec
Messages (9)
msg241858 - (view) Author: Ma Lin (Ma Lin) * Date: 2015-04-23 09:29
While I was trying to optimize GB2312/GBK/GB18030-2000 codecs (they are three encodings that widely used in China), I found there is a bug.

The three encodings, their relation should be: GB2312 ⊂ GBK ⊂ GB18030-2000.
However, in Python's implement: GB2312 ⊄ GBK ⊂ GB18030-2000.
GBK should be backward compatible with GB2312, but in Python's implement, it's not.

----
I digged into, I found the Python's GB2312 codec is using a wrong convert table.
In this file /Modules/cjkcodecs/_codecs_cn.c , there is a comment block, I paste it here: 

/* GBK and GB2312 map differently in few code points that are listed below:
 *
 *              gb2312                          gbk
 * A1A4         U+30FB KATAKANA MIDDLE DOT      U+00B7 MIDDLE DOT
 * A1AA         U+2015 HORIZONTAL BAR           U+2014 EM DASH
 * A844         undefined                       U+2015 HORIZONTAL BAR
 */
 
 In fact the second column (GB2312 column) is wrong, this column should be deleted.
 
 The four involved unicode codepoints are:
 U+30FB  ・     KATAKANA MIDDLE DOT
 U+00B7  ·      MIDDLE DOT
 U+2015  ―      HORIZONTAL BAR
 U+2014  —      EM DASH

So, GB2312 codec decodes b'0xA1, 0xA4' to U+30FB.
U+30FB is a Japanese symbol, but looks quite similar to U+00B7.

I searched "GB2312 Unicode Table" with Google, there are right verson and wrong version on the Internet, unfortunately we are using the wrong verson.

libiconv-1.14 is also using the wrong version.

----
Hold an example of bad behavior. 

Using GBK encoder, encode U+30FB to bytes, UnicodeEncodeError exception occurred, becase U+30FB is not in GBK.

In Simplified Chinese version of Microsoft Windows, console's default encoding is GBK[1].
If using GB2312 decoder to decode b'0xA1, 0xA4', then print U+30FB to console, UnicodeEncodeError raised.
Since DASH is a common character, this bug is annoying.

----
If we fix this, I don't know how many disasters will happen.
However, if we don't fix this, it's a bug.

I already made a patch, but I think we need a discussion, should we fix this?

-----------------------
Annotate:
[1] In fact console's default encoding is cp936, cp936 almost same as GBK, but not entirely same. Using GBK in here is not a problem.
msg241859 - (view) Author: Ma Lin (Ma Lin) * Date: 2015-04-23 09:43
"Since MIDDLE DOT is a common character, this bug is annoying."
Sorry, it's MIDDLE DOT, not DASH.
msg241909 - (view) Author: Ma Lin (Ma Lin) * Date: 2015-04-24 04:19
Today, I investigated these popular programming languages, all are the latest version.

iconv-1.14              wrong version
php-5.6.8               wrong version (php is using iconv)
ActivePerl-5.20.2    wrong version
GoLang-1.4.2         no GB2312, only has GBK/GB18030 (golang.org/x/text/encoding)
Java 1.7.0_79-b15  wrong version (java.nio.charset)
.Net 2013               rignt version

It seems Python should stay at the wrong version.
Very sorry for waste your time.

Appendix A:
/* The right version's table should be:
 *
 *                  gb2312                             gbk
 * A1A4        U+00B7 MIDDLE DOT      U+00B7 MIDDLE DOT
 * A1AA        U+2014 EM DASH           U+2014 EM DASH
 * A844         undefined                        U+2015 HORIZONTAL BAR
 */
 
Appendix B:
Advice for final user:
1, Use GBK as much as possible.
2, Be careful when you do interactive operation between GB2312 and GBK/GB18030.
msg241920 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-24 07:43
Hi Ma Lin,

thank you for your investigation. In order to fix these tables, we'd need an official reference which shows that there is in fact an error. If most programming languages you have tested use the "wrong" version, then maybe it's not wrong after all :-)

Adding a new mapping for A844 should not be a problem and the other mappings don't seem to introduce much difference in terms of how the glyphs look. However, by changing such mappings we'd break roundtrip safety.
msg241938 - (view) Author: Ma Lin (Ma Lin) * Date: 2015-04-24 12:41
Andre Lemburg,

We don't need any modify, A844 is in GBK but not in GB2312, so no need to add it into GB2312.

Your logic is right, it's hard to judge which one is wrong.
But U+30FB (· KATAKANA MIDDLE DOT) and U+2015 (— HORIZONTAL BAR) have no reason among these Chinese common punctuation symbol.
A1A2-A1B7:
、	。	・	ˉ	ˇ	¨	〃	々	―	~	‖	…	‘	’“	”	〔	〕	〈	〉	《	》	

If they are U+00B7 (· MIDDLE DOT) and  U+2014 (— EM DASH), this section looks more reasonable.

GB2312 was published in early 1980s, it seems there was a historical accident.
Luckily, most programming languages are on the same side.
msg257454 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-01-04 10:57
I think we can close this issue as "won't fix".

It's a bug, but one which is present in a lot of other systems as well, so we'd potentially make it impossible to write GB2312 data which is supposed to be read back by these other systems.

Ma Lin: Do you agree ?
msg257457 - (view) Author: Ma Lin (Ma Lin) * Date: 2016-01-04 11:30
I agree with you, "won't fix".
msg257458 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-01-04 11:37
Thanks, Ma Lin.
msg277926 - (view) Author: Mingye Wang (Artoria2e5) * Date: 2016-10-03 03:50
> Advice for final user:

This seems something worthy of adding to the codecs doc as a footnote. Perhaps something like "(deprecated) ... gb2312 is an obsolete encoding from the 1980s. Use gbk or gb18030 instead." will do.

> libiconv-1.14 is also using the wrong version.

Just a side note on the right/wrongfulness of libiconv: I have reported the GB18030 incompatibility as a libiconv bug.[1] From the replies, I learnt that 1) what libiconv is using currently is a then-official mapping published on ftp.unicode.org; 2) vendor implementations of gb2312 differed historically. I have updated the corresponding section[2] on Wikipedia to include these old references.
  [1]: https://lists.gnu.org/archive/html/bug-gnu-libiconv/2016-09/msg00004.html
  [2]: https://en.wikipedia.org/wiki/GB_2312#Two_implementations_of_GB2312

Still, being old and common does not necessarily mean being correct, as Ma Lin have demonstrated by showing the character semantics. To reflect this in a better-supported manner, I have added names for the glyphs in question from GB2312-80 to [2].
History
Date User Action Args
2016-10-03 03:50:57Artoria2e5setnosy: + Artoria2e5
messages: + msg277926
2016-01-04 11:37:52lemburgsetstatus: open -> closed
resolution: wont fix
messages: + msg257458
2016-01-04 11:30:55Ma Linsetmessages: + msg257457
2016-01-04 10:57:27lemburgsetmessages: + msg257454
2016-01-02 08:38:31ezio.melottisetstage: patch review
2015-04-24 12:41:23Ma Linsetmessages: + msg241938
2015-04-24 07:43:53lemburgsetmessages: + msg241920
2015-04-24 04:19:48Ma Linsetmessages: + msg241909
2015-04-23 13:43:22serhiy.storchakasetnosy: + lemburg, loewis
2015-04-23 09:43:55Ma Linsetmessages: + msg241859
2015-04-23 09:29:41Ma Lincreate