classification
Title: ks_c-5601-1987 is used by microsoft when it really means cp949
Type: enhancement Stage:
Components: email, Unicode Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, ezio.melotti, hyeshik.chang, lemburg, r.david.murray
Priority: normal Keywords:

Created on 2013-08-01 23:54 by r.david.murray, last changed 2013-08-02 08:23 by lemburg.

Messages (2)
msg194139 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-01 23:54
When Microsoft handles Korean text, it uses its own code page, cp949, which is a superset of ks_c-5601-1987.  But when talking to the rest of the world, it claims that the character set name is ks_c-5601-1987.  This means that text claimed to be in ks_c-5601-1987 in email messages (and probably on web pages) can't always be decoded using the codec that ks_c-5601-1987 maps to (euc_kr). [*]

This problem shows up in the real world in email.  If non euc_kr characters are used, email will try blow up when trying to decode the ostensibly ks_c-5601-1987 text.  (I'm not sure if it will also blow up trying to encode it, I'm not sure what characters the two codecs cover.)

I'm not sure what the best solution is, but one possibility would be to add a "fixup" table to email that would cause it to decode ostensibly ks_c-5601-1987 text using the cp949 codec.  Since cp949 is a superset, this should at least solve the input side.

[*] Some relevant standards discussion:
   
    http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html

From this, it isn't clear why we map ks_c-5601-1987 to euc_kr, since they at least appear to be different codecs.  I haven't looked at the relevant RFCs to see what the differences are, though.
msg194163 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-08-02 08:23
The alias was added by Hye-Shik Chang:

http://hg.python.org/cpython-fullhistory/annotate/887ce39f95f2/Lib/encodings/aliases.py#198

I've added him to the nosy list.

If the alias don't match, we'd have to add a codec for the mismatching encoding to maintain compatibility (provided the mismatching encoding is still in use).
History
Date User Action Args
2013-08-02 08:23:53lemburgsetnosy: + lemburg, hyeshik.chang
messages: + msg194163
2013-08-01 23:54:04r.david.murraycreate