Issue 23050: Add Japanese legacy encodings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67239

classification

Title:	Add Japanese legacy encodings
Type:	enhancement	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ishimoto, lemburg, loewis, methane, r.david.murray, serhiy.storchaka, t2y, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-12-14 14:34 by t2y, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
add-japanese-legacy-encoding1.patch	t2y, 2014-12-14 14:34	Add Japanese legacy encodings	review

Repositories containing patches
https://bitbucket.org/t2y/cpython/#japanese-legacy-encoding

Messages (9)
msg232638 - (view)	Author: Tetsuya Morimoto (t2y) *	Date: 2014-12-14 14:34
This patch adds Japanese legacy encodings as below. https://bitbucket.org/t2y/cpython/branches/compare/japanese-legacy-encoding..default * eucjp_ms (euc-jp compatible with cp932) * iso2022_jp_ms (yet another iso-2022-jp compatible with cp932, similar to cp50220) * cp50220 (http://www.iana.org/assignments/charset-reg/CP50220) * cp50221 (a variant of cp50220) * cp50222 (a variant of cp50220) * cp51932 (http://www.iana.org/assignments/charset-reg/CP51932) Originally, these character encodings patch was created as result in IPA project in 2005, by Masayuki Moriyama. The result was contributed to several community: libiconv, glibc, perl, PHP, Ruby, PostgreSQL, MySQL, nkf. He had made a patch for Python 2.4.3 at that time, but somehow, no one worked to integrate. That's a crying shame. These character encodings are legacy, but are still used. Lots of end-user don't care the character encoding. Unfortunately, for historical reason, e-mails are encoded with these legacy encodings on Japanese Windows platform. Actually, my customer recently reported about Mojibake since its e-mail data would be encoded with cp50220 (iso-2022-jp-ms). References: * About IPA: http://www.ipa.go.jp/english/about/summary.html * Mojibake: http://en.wikipedia.org/wiki/Mojibake * Java encoding names: http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html References in Japanese: * Japanese Legacy Encoding Project: http://legacy-encoding.sourceforge.jp/wiki/ * Project details: http://www.ipa.go.jp/about/jigyoseika/05fy-pro/open/2005-1467d.pdf
msg232639 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-12-14 16:04
In emails these are labeled as, say, iso-2022-jp-ms? See also issue 8898 with regards to email encodings.
msg232640 - (view)	Author: Tetsuya Morimoto (t2y) *	Date: 2014-12-14 16:28
On Mon, Dec 15, 2014 at 1:04 AM, R. David Murray <report@bugs.python.org> wrote: > In emails these are labeled as, say, iso-2022-jp-ms? No. These are labeled just 'iso-2022-jp' and we (japanese) choose proper charset encoding to decode the encoded text. You can see several variants of iso-2022-jp. Yes, that's a very strange, but it's a historical reason. http://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets > See also issue 8898 with regards to email encodings. Therefore, this is different issue.
msg232668 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-12-15 16:45
> These character encodings are legacy, but are still used. Do you have an idea of how many users still have documents stored or exchanged using these encodings? The patch is not trivial, the legacy japanese codecs are complex and so error prone :-/ For previous requests to add new codecs, we closed issues as wontfix and we suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more complex because C code is modified to implement the new encodings. $ diffstat issue23050_13417.diff Doc/library/codecs.rst \| 16 Lib/encodings/aliases.py \| 26 Lib/test/test_codecencodings_iso2022.py \| 59 + Lib/test/test_codecs.py \| 2 Lib/test/test_multibytecodec.py \| 6 Lib/test/test_xml_etree.py \| 4 Modules/cjkcodecs/_codecs_iso2022.c \| 718 ++++++++++++++++++----- Modules/cjkcodecs/_codecs_jp.c \| 305 +++++++++ Modules/cjkcodecs/mappings_jp.h \| 950 ++++++++++++++++++++++--------- Modules/cjkcodecs/multibytecodec.h \| 11 Python/importlib.h \| 860 ++++++++++++++-------------- b/Lib/encodings/cp50220.py \| 39 + b/Lib/encodings/cp50221.py \| 39 + b/Lib/encodings/cp50222.py \| 39 + b/Lib/encodings/cp51932.py \| 39 + b/Lib/encodings/eucjp_ms.py \| 39 + b/Lib/encodings/iso2022_jp_ms.py \| 39 + b/Lib/test/cjkencodings/cp50220-utf8.txt \| 30 b/Lib/test/cjkencodings/cp50220.txt \| 30 b/Modules/cjkcodecs/mappings_cp50220_k.h \| 31 + 20 files changed, 2452 insertions(+), 830 deletions(-)
msg232674 - (view)	Author: Tetsuya Morimoto (t2y) *	Date: 2014-12-15 17:21
>> These character encodings are legacy, but are still used. > > Do you have an idea of how many users still have documents stored or exchanged using these encodings? Hmm, I guess iso-2022-jp codec is still default charset of MUA (Mail User Agent) on Japanese Windows platform. But I'm not sure how many so I'll investigate, wait a few days. > The patch is not trivial, the legacy japanese codecs are complex and so error prone :-/ Ya, this patch has some refactoring. However, existing tests have passed and adding encoding codecs wouldn't affect other codecs basically. Why do you think it's "error plone"? > For previous requests to add new codecs, we closed issues as wontfix and we suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more complex because C code is modified to implement the new encodings. Could you show me previous requests? I can understand C code modifying is higher cost to review. However, we have codec tests and it wouldn't affect other codecs, I think.
msg232684 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-12-15 21:35
I refactored some parts of CJK codecs for performances, after the PEP 393 was implemented. A blocker point was that these codecs have very few tests. Not for valid data but for invalid data. It may be a little bit better. I tried to write a test for each path in if/else, to test all cases, in the codecs that I modified. By error prone, it mean that it's easy to introduce a bug or a regressio, since the code is complex and almost nobody maintains it. I'm not stongly opposed to any change. I'm just trying to understand the context.
msg232685 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-12-15 21:48
Another traditional issue with Japanese codecs is that people have different opinions on what the encoding should do. It may be that when we release the codec, somebody comes up and says that the codec is incorrect, and it should do something different for some code points, citing some other applications which he considers right. In particular for the Microsoft ones, people may claim that some version of Windows did things differently. Now, for this set, the ones that got registered with IANA sound ok (in the sense that it is our bug if they fail to conform to the IANA spec, and IANA's fault if they fail to do what users expect). For the other ones, I wonder whether there is some official source that can be consulted for correctness. On a different note: why do you claim that the code is written by Perky? (it's not you, is it?)
msg232702 - (view)	Author: Tetsuya Morimoto (t2y) *	Date: 2014-12-16 02:49
> By error prone, it mean that it's easy to introduce a bug or a regression, > since the code is complex and almost nobody maintains it. Indeed. Actually, I encountered some faults when I migrated original patch. The character encoding is a kind of specialty area. This patch is written by Masayuki Moriyama, who is an expert of character encoding and he have been contributed to various communities for a long time. Also, he helps me to migrate original patch(for Python 2.4.3) to Python 3.5. You can see commit log he fixed some bugs. https://bitbucket.org/t2y/cpython/commits/all > I'm not stongly opposed to any change. I'm just trying to understand the > context. Thanks. I'll help it by explaining the context.
msg232707 - (view)	Author: Tetsuya Morimoto (t2y) *	Date: 2014-12-16 04:18
> Another traditional issue with Japanese codecs is that people have different opinions on what the encoding should do. It may be that when we release the codec, somebody comes up and says that the codec is incorrect, and it should do something different for some code points, citing some other applications which he considers right. In particular for the Microsoft ones, people may claim that some version of Windows did things differently. In regard to e-mail encoding, Japanese should use utf-8, then it resolves most problems. However, for historical reason or compatibility reason, it's different even today. I don't think these legacy codecs are needed for individual application, but we sometimes encounter an encoding issue when an application collaborates to external system like e-mail. > Now, for this set, the ones that got registered with IANA sound ok (in the sense that it is our bug if they fail to conform to the IANA spec, and IANA's fault if they fail to do what users expect). For the other ones, I wonder whether there is some official source that can be consulted for correctness. Exactly. Now, I'm finding euc-jp-ms and iso-2022-jp-ms spec in English. Of course, there's a voluntary document in Japanese as follows. http://www.wdic.org/w/WDIC/eucJP-ms http://www.wdic.org/w/WDIC/ISO-2022-JP-MS I may agree with dropping character encoding which is difficult to find official source. > On a different note: why do you claim that the code is written by Perky? (it's not you, is it?) Right! Because the credit belongs to him. I'm an assistant.

History
Date	User	Action	Args
2022-04-11 14:58:11	admin	set	github: 67239
2014-12-16 04:18:44	t2y	set	messages: + msg232707
2014-12-16 02:49:37	t2y	set	messages: + msg232702
2014-12-15 21:48:29	loewis	set	messages: + msg232685
2014-12-15 21:35:36	vstinner	set	messages: + msg232684
2014-12-15 17:21:59	t2y	set	messages: + msg232674
2014-12-15 16:45:57	vstinner	set	nosy: + vstinner messages: + msg232668
2014-12-14 16:55:24	serhiy.storchaka	set	nosy: + lemburg, loewis, serhiy.storchaka stage: patch review
2014-12-14 16:28:43	t2y	set	messages: + msg232640
2014-12-14 16:04:39	r.david.murray	set	nosy: + r.david.murray messages: + msg232639
2014-12-14 14:34:49	t2y	create