Message 194139 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	r.david.murray
Recipients	barry, ezio.melotti, r.david.murray
Date	2013-08-01.23:54:03
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1375401244.03.0.758628896825.issue18625@psf.upfronthosting.co.za>
In-reply-to

Content
When Microsoft handles Korean text, it uses its own code page, cp949, which is a superset of ks_c-5601-1987. But when talking to the rest of the world, it claims that the character set name is ks_c-5601-1987. This means that text claimed to be in ks_c-5601-1987 in email messages (and probably on web pages) can't always be decoded using the codec that ks_c-5601-1987 maps to (euc_kr). [] This problem shows up in the real world in email. If non euc_kr characters are used, email will try blow up when trying to decode the ostensibly ks_c-5601-1987 text. (I'm not sure if it will also blow up trying to encode it, I'm not sure what characters the two codecs cover.) I'm not sure what the best solution is, but one possibility would be to add a "fixup" table to email that would cause it to decode ostensibly ks_c-5601-1987 text using the cp949 codec. Since cp949 is a superset, this should at least solve the input side. [] Some relevant standards discussion: http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html From this, it isn't clear why we map ks_c-5601-1987 to euc_kr, since they at least appear to be different codecs. I haven't looked at the relevant RFCs to see what the differences are, though.

When Microsoft handles Korean text, it uses its own code page, cp949, which is a superset of ks_c-5601-1987.  But when talking to the rest of the world, it claims that the character set name is ks_c-5601-1987.  This means that text claimed to be in ks_c-5601-1987 in email messages (and probably on web pages) can't always be decoded using the codec that ks_c-5601-1987 maps to (euc_kr). [*]

This problem shows up in the real world in email.  If non euc_kr characters are used, email will try blow up when trying to decode the ostensibly ks_c-5601-1987 text.  (I'm not sure if it will also blow up trying to encode it, I'm not sure what characters the two codecs cover.)

I'm not sure what the best solution is, but one possibility would be to add a "fixup" table to email that would cause it to decode ostensibly ks_c-5601-1987 text using the cp949 codec.  Since cp949 is a superset, this should at least solve the input side.

[*] Some relevant standards discussion:
   
    http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html

From this, it isn't clear why we map ks_c-5601-1987 to euc_kr, since they at least appear to be different codecs.  I haven't looked at the relevant RFCs to see what the differences are, though.

History
Date	User	Action	Args
2013-08-01 23:54:04	r.david.murray	set	recipients: + r.david.murray, barry, ezio.melotti
2013-08-01 23:54:04	r.david.murray	set	messageid: <1375401244.03.0.758628896825.issue18625@psf.upfronthosting.co.za>
2013-08-01 23:54:03	r.david.murray	link	issue18625 messages
2013-08-01 23:54:03	r.david.murray	create