This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add Japanese legacy encodings
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ishimoto, lemburg, loewis, methane, r.david.murray, serhiy.storchaka, t2y, vstinner
Priority: normal Keywords: patch

Created on 2014-12-14 14:34 by t2y, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
add-japanese-legacy-encoding1.patch t2y, 2014-12-14 14:34 Add Japanese legacy encodings review
Repositories containing patches
https://bitbucket.org/t2y/cpython/#japanese-legacy-encoding
Messages (9)
msg232638 - (view) Author: Tetsuya Morimoto (t2y) * Date: 2014-12-14 14:34
This patch adds Japanese legacy encodings as below.
https://bitbucket.org/t2y/cpython/branches/compare/japanese-legacy-encoding..default

* eucjp_ms (euc-jp compatible with cp932)
* iso2022_jp_ms (yet another iso-2022-jp compatible with cp932, similar to cp50220)
* cp50220 (http://www.iana.org/assignments/charset-reg/CP50220)
* cp50221 (a variant of cp50220)
* cp50222 (a variant of cp50220)
* cp51932 (http://www.iana.org/assignments/charset-reg/CP51932)

Originally, these character encodings patch was created as result in IPA project in 2005, by Masayuki Moriyama. The result was contributed to several community: libiconv, glibc, perl, PHP, Ruby, PostgreSQL, MySQL, nkf. He had made a patch for Python 2.4.3 at that time, but somehow, no one worked to integrate. That's a crying shame.

These character encodings are legacy, but are still used. Lots of end-user don't care the character encoding. Unfortunately, for historical reason, e-mails are encoded with these legacy encodings on Japanese Windows platform. Actually, my customer recently reported about Mojibake since its e-mail data would be encoded with cp50220 (iso-2022-jp-ms).

References:

* About IPA: http://www.ipa.go.jp/english/about/summary.html
* Mojibake: http://en.wikipedia.org/wiki/Mojibake
* Java encoding names: http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

References in Japanese:

* Japanese Legacy Encoding Project: http://legacy-encoding.sourceforge.jp/wiki/
* Project details: http://www.ipa.go.jp/about/jigyoseika/05fy-pro/open/2005-1467d.pdf
msg232639 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-12-14 16:04
In emails these are labeled as, say, iso-2022-jp-ms?

See also issue 8898 with regards to email encodings.
msg232640 - (view) Author: Tetsuya Morimoto (t2y) * Date: 2014-12-14 16:28
On Mon, Dec 15, 2014 at 1:04 AM, R. David Murray <report@bugs.python.org> wrote:
> In emails these are labeled as, say, iso-2022-jp-ms?

No. These are labeled just 'iso-2022-jp' and we (japanese) choose
proper charset encoding to decode the encoded text. You can see
several variants of iso-2022-jp. Yes, that's a very strange, but it's
a historical reason.

http://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets

> See also issue 8898 with regards to email encodings.

Therefore, this is different issue.
msg232668 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-12-15 16:45
> These character encodings are legacy, but are still used.

Do you have an idea of how many users still have documents stored or exchanged using these encodings? The patch is not trivial, the legacy japanese codecs are complex and so error prone :-/

For previous requests to add new codecs, we closed issues as wontfix and we suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more complex because C code is modified to implement the new encodings.

$ diffstat issue23050_13417.diff 
 Doc/library/codecs.rst                   |   16 
 Lib/encodings/aliases.py                 |   26 
 Lib/test/test_codecencodings_iso2022.py  |   59 +
 Lib/test/test_codecs.py                  |    2 
 Lib/test/test_multibytecodec.py          |    6 
 Lib/test/test_xml_etree.py               |    4 
 Modules/cjkcodecs/_codecs_iso2022.c      |  718 ++++++++++++++++++-----
 Modules/cjkcodecs/_codecs_jp.c           |  305 +++++++++
 Modules/cjkcodecs/mappings_jp.h          |  950 ++++++++++++++++++++++---------
 Modules/cjkcodecs/multibytecodec.h       |   11 
 Python/importlib.h                       |  860 ++++++++++++++--------------
 b/Lib/encodings/cp50220.py               |   39 +
 b/Lib/encodings/cp50221.py               |   39 +
 b/Lib/encodings/cp50222.py               |   39 +
 b/Lib/encodings/cp51932.py               |   39 +
 b/Lib/encodings/eucjp_ms.py              |   39 +
 b/Lib/encodings/iso2022_jp_ms.py         |   39 +
 b/Lib/test/cjkencodings/cp50220-utf8.txt |   30 
 b/Lib/test/cjkencodings/cp50220.txt      |   30 
 b/Modules/cjkcodecs/mappings_cp50220_k.h |   31 +
 20 files changed, 2452 insertions(+), 830 deletions(-)
msg232674 - (view) Author: Tetsuya Morimoto (t2y) * Date: 2014-12-15 17:21
>> These character encodings are legacy, but are still used.
>
> Do you have an idea of how many users still have documents stored or exchanged using these encodings?

Hmm, I guess iso-2022-jp codec is still default charset of MUA (Mail
User Agent) on Japanese Windows platform. But I'm not sure how many so
I'll investigate, wait a few days.

> The patch is not trivial, the legacy japanese codecs are complex and so error prone :-/

Ya, this patch has some refactoring. However, existing tests have
passed and adding encoding codecs wouldn't affect other codecs
basically. Why do you think it's "error plone"?

> For previous requests to add new codecs, we closed issues as wontfix and we suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more complex because C code is modified to implement the new encodings.

Could you show me previous requests? I can understand C code modifying
is higher cost to review. However, we have codec tests and it wouldn't
affect other codecs, I think.
msg232684 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-12-15 21:35
I refactored some parts of CJK codecs for performances, after the PEP 393
was implemented. A blocker point was that these codecs have very few tests.
Not for valid data but for invalid data. It may be a little bit better. I
tried to write a test for each path in if/else, to test all cases, in the
codecs that I modified.

By error prone, it mean that it's easy to introduce a bug or a regressio,
since the code is complex and almost nobody maintains it.

I'm not stongly opposed to any change. I'm just trying to understand the
context.
msg232685 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-12-15 21:48
Another traditional issue with Japanese codecs is that people have different opinions on what the encoding should do. It may be that when we release the codec, somebody comes up and says that the codec is incorrect, and it should do something different for some code points, citing some other applications which he considers right. In particular for the Microsoft ones, people may claim that some version of Windows did things differently.

Now, for this set, the ones that got registered with IANA sound ok (in the sense that it is our bug if they fail to conform to the IANA spec, and IANA's fault if they fail to do what users expect). For the other ones, I wonder whether there is some official source that can be consulted for correctness.

On a different note: why do you claim that the code is written by Perky? (it's not you, is it?)
msg232702 - (view) Author: Tetsuya Morimoto (t2y) * Date: 2014-12-16 02:49
> By error prone, it mean that it's easy to introduce a bug or a regression,
> since the code is complex and almost nobody maintains it.

Indeed. Actually, I encountered some faults when I migrated original
patch. The character encoding is a kind of specialty area. This patch
is written by Masayuki Moriyama, who is an expert of character
encoding and he have been contributed to various communities for a
long time. Also, he helps me to migrate original patch(for Python
2.4.3) to Python 3.5. You can see commit log he fixed some bugs.
https://bitbucket.org/t2y/cpython/commits/all

> I'm not stongly opposed to any change. I'm just trying to understand the
> context.

Thanks. I'll help it by explaining the context.
msg232707 - (view) Author: Tetsuya Morimoto (t2y) * Date: 2014-12-16 04:18
> Another traditional issue with Japanese codecs is that people have different opinions on what the encoding should do. It may be that when we release the codec, somebody comes up and says that the codec is incorrect, and it should do something different for some code points, citing some other applications which he considers right. In particular for the Microsoft ones, people may claim that some version of Windows did things differently.

In regard to e-mail encoding, Japanese should use utf-8, then it
resolves most problems. However, for historical reason or
compatibility reason, it's different even today. I don't think these
legacy codecs are needed for individual application, but we sometimes
encounter an encoding issue when an application collaborates to
external system like e-mail.

> Now, for this set, the ones that got registered with IANA sound ok (in the sense that it is our bug if they fail to conform to the IANA spec, and IANA's fault if they fail to do what users expect). For the other ones, I wonder whether there is some official source that can be consulted for correctness.

Exactly. Now, I'm finding euc-jp-ms and iso-2022-jp-ms spec in
English. Of course, there's a voluntary document in Japanese as
follows.
http://www.wdic.org/w/WDIC/eucJP-ms
http://www.wdic.org/w/WDIC/ISO-2022-JP-MS

I may agree with dropping character encoding which is difficult to
find official source.

> On a different note: why do you claim that the code is written by Perky? (it's not you, is it?)

Right! Because the credit belongs to him. I'm an assistant.
History
Date User Action Args
2022-04-11 14:58:11adminsetgithub: 67239
2014-12-16 04:18:44t2ysetmessages: + msg232707
2014-12-16 02:49:37t2ysetmessages: + msg232702
2014-12-15 21:48:29loewissetmessages: + msg232685
2014-12-15 21:35:36vstinnersetmessages: + msg232684
2014-12-15 17:21:59t2ysetmessages: + msg232674
2014-12-15 16:45:57vstinnersetnosy: + vstinner
messages: + msg232668
2014-12-14 16:55:24serhiy.storchakasetnosy: + lemburg, loewis, serhiy.storchaka

stage: patch review
2014-12-14 16:28:43t2ysetmessages: + msg232640
2014-12-14 16:04:39r.david.murraysetnosy: + r.david.murray
messages: + msg232639
2014-12-14 14:34:49t2ycreate