Issue 18625: ks_c-5601-1987 is used by microsoft when it really means cp949

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62825

classification

Title:	ks_c-5601-1987 is used by microsoft when it really means cp949
Type:	enhancement	Stage:
Components:	email, Unicode	Versions:	Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, ezio.melotti, hyeshik.chang, lemburg, r.david.murray
Priority:	normal	Keywords:

Created on 2013-08-01 23:54 by r.david.murray, last changed 2022-04-11 14:57 by admin.

Messages (2)
msg194139 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-01 23:54
When Microsoft handles Korean text, it uses its own code page, cp949, which is a superset of ks_c-5601-1987. But when talking to the rest of the world, it claims that the character set name is ks_c-5601-1987. This means that text claimed to be in ks_c-5601-1987 in email messages (and probably on web pages) can't always be decoded using the codec that ks_c-5601-1987 maps to (euc_kr). [] This problem shows up in the real world in email. If non euc_kr characters are used, email will try blow up when trying to decode the ostensibly ks_c-5601-1987 text. (I'm not sure if it will also blow up trying to encode it, I'm not sure what characters the two codecs cover.) I'm not sure what the best solution is, but one possibility would be to add a "fixup" table to email that would cause it to decode ostensibly ks_c-5601-1987 text using the cp949 codec. Since cp949 is a superset, this should at least solve the input side. [] Some relevant standards discussion: http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html From this, it isn't clear why we map ks_c-5601-1987 to euc_kr, since they at least appear to be different codecs. I haven't looked at the relevant RFCs to see what the differences are, though.
msg194163 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-08-02 08:23
The alias was added by Hye-Shik Chang: http://hg.python.org/cpython-fullhistory/annotate/887ce39f95f2/Lib/encodings/aliases.py#198 I've added him to the nosy list. If the alias don't match, we'd have to add a codec for the mismatching encoding to maintain compatibility (provided the mismatching encoding is still in use).

History
Date	User	Action	Args
2022-04-11 14:57:48	admin	set	github: 62825
2013-08-02 08:23:53	lemburg	set	nosy: + lemburg, hyeshik.chang messages: + msg194163
2013-08-01 23:54:04	r.david.murray	create