This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author r.david.murray
Recipients barry, ezio.melotti, r.david.murray
Date 2013-08-01.23:54:03
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1375401244.03.0.758628896825.issue18625@psf.upfronthosting.co.za>
In-reply-to
Content
When Microsoft handles Korean text, it uses its own code page, cp949, which is a superset of ks_c-5601-1987.  But when talking to the rest of the world, it claims that the character set name is ks_c-5601-1987.  This means that text claimed to be in ks_c-5601-1987 in email messages (and probably on web pages) can't always be decoded using the codec that ks_c-5601-1987 maps to (euc_kr). [*]

This problem shows up in the real world in email.  If non euc_kr characters are used, email will try blow up when trying to decode the ostensibly ks_c-5601-1987 text.  (I'm not sure if it will also blow up trying to encode it, I'm not sure what characters the two codecs cover.)

I'm not sure what the best solution is, but one possibility would be to add a "fixup" table to email that would cause it to decode ostensibly ks_c-5601-1987 text using the cp949 codec.  Since cp949 is a superset, this should at least solve the input side.

[*] Some relevant standards discussion:
   
    http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html

From this, it isn't clear why we map ks_c-5601-1987 to euc_kr, since they at least appear to be different codecs.  I haven't looked at the relevant RFCs to see what the differences are, though.
History
Date User Action Args
2013-08-01 23:54:04r.david.murraysetrecipients: + r.david.murray, barry, ezio.melotti
2013-08-01 23:54:04r.david.murraysetmessageid: <1375401244.03.0.758628896825.issue18625@psf.upfronthosting.co.za>
2013-08-01 23:54:03r.david.murraylinkissue18625 messages
2013-08-01 23:54:03r.david.murraycreate