Issue23050
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014-12-14 14:34 by t2y, last changed 2022-04-11 14:58 by admin.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
add-japanese-legacy-encoding1.patch | t2y, 2014-12-14 14:34 | Add Japanese legacy encodings | review |
Repositories containing patches | |||
---|---|---|---|
https://bitbucket.org/t2y/cpython/#japanese-legacy-encoding |
Messages (9) | |||
---|---|---|---|
msg232638 - (view) | Author: Tetsuya Morimoto (t2y) * | Date: 2014-12-14 14:34 | |
This patch adds Japanese legacy encodings as below. https://bitbucket.org/t2y/cpython/branches/compare/japanese-legacy-encoding..default * eucjp_ms (euc-jp compatible with cp932) * iso2022_jp_ms (yet another iso-2022-jp compatible with cp932, similar to cp50220) * cp50220 (http://www.iana.org/assignments/charset-reg/CP50220) * cp50221 (a variant of cp50220) * cp50222 (a variant of cp50220) * cp51932 (http://www.iana.org/assignments/charset-reg/CP51932) Originally, these character encodings patch was created as result in IPA project in 2005, by Masayuki Moriyama. The result was contributed to several community: libiconv, glibc, perl, PHP, Ruby, PostgreSQL, MySQL, nkf. He had made a patch for Python 2.4.3 at that time, but somehow, no one worked to integrate. That's a crying shame. These character encodings are legacy, but are still used. Lots of end-user don't care the character encoding. Unfortunately, for historical reason, e-mails are encoded with these legacy encodings on Japanese Windows platform. Actually, my customer recently reported about Mojibake since its e-mail data would be encoded with cp50220 (iso-2022-jp-ms). References: * About IPA: http://www.ipa.go.jp/english/about/summary.html * Mojibake: http://en.wikipedia.org/wiki/Mojibake * Java encoding names: http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html References in Japanese: * Japanese Legacy Encoding Project: http://legacy-encoding.sourceforge.jp/wiki/ * Project details: http://www.ipa.go.jp/about/jigyoseika/05fy-pro/open/2005-1467d.pdf |
|||
msg232639 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2014-12-14 16:04 | |
In emails these are labeled as, say, iso-2022-jp-ms? See also issue 8898 with regards to email encodings. |
|||
msg232640 - (view) | Author: Tetsuya Morimoto (t2y) * | Date: 2014-12-14 16:28 | |
On Mon, Dec 15, 2014 at 1:04 AM, R. David Murray <report@bugs.python.org> wrote: > In emails these are labeled as, say, iso-2022-jp-ms? No. These are labeled just 'iso-2022-jp' and we (japanese) choose proper charset encoding to decode the encoded text. You can see several variants of iso-2022-jp. Yes, that's a very strange, but it's a historical reason. http://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets > See also issue 8898 with regards to email encodings. Therefore, this is different issue. |
|||
msg232668 - (view) | Author: STINNER Victor (vstinner) * | Date: 2014-12-15 16:45 | |
> These character encodings are legacy, but are still used. Do you have an idea of how many users still have documents stored or exchanged using these encodings? The patch is not trivial, the legacy japanese codecs are complex and so error prone :-/ For previous requests to add new codecs, we closed issues as wontfix and we suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more complex because C code is modified to implement the new encodings. $ diffstat issue23050_13417.diff Doc/library/codecs.rst | 16 Lib/encodings/aliases.py | 26 Lib/test/test_codecencodings_iso2022.py | 59 + Lib/test/test_codecs.py | 2 Lib/test/test_multibytecodec.py | 6 Lib/test/test_xml_etree.py | 4 Modules/cjkcodecs/_codecs_iso2022.c | 718 ++++++++++++++++++----- Modules/cjkcodecs/_codecs_jp.c | 305 +++++++++ Modules/cjkcodecs/mappings_jp.h | 950 ++++++++++++++++++++++--------- Modules/cjkcodecs/multibytecodec.h | 11 Python/importlib.h | 860 ++++++++++++++-------------- b/Lib/encodings/cp50220.py | 39 + b/Lib/encodings/cp50221.py | 39 + b/Lib/encodings/cp50222.py | 39 + b/Lib/encodings/cp51932.py | 39 + b/Lib/encodings/eucjp_ms.py | 39 + b/Lib/encodings/iso2022_jp_ms.py | 39 + b/Lib/test/cjkencodings/cp50220-utf8.txt | 30 b/Lib/test/cjkencodings/cp50220.txt | 30 b/Modules/cjkcodecs/mappings_cp50220_k.h | 31 + 20 files changed, 2452 insertions(+), 830 deletions(-) |
|||
msg232674 - (view) | Author: Tetsuya Morimoto (t2y) * | Date: 2014-12-15 17:21 | |
>> These character encodings are legacy, but are still used. > > Do you have an idea of how many users still have documents stored or exchanged using these encodings? Hmm, I guess iso-2022-jp codec is still default charset of MUA (Mail User Agent) on Japanese Windows platform. But I'm not sure how many so I'll investigate, wait a few days. > The patch is not trivial, the legacy japanese codecs are complex and so error prone :-/ Ya, this patch has some refactoring. However, existing tests have passed and adding encoding codecs wouldn't affect other codecs basically. Why do you think it's "error plone"? > For previous requests to add new codecs, we closed issues as wontfix and we suggested to share the codecs at the Python Cheeseshop (PyPI). Here it's more complex because C code is modified to implement the new encodings. Could you show me previous requests? I can understand C code modifying is higher cost to review. However, we have codec tests and it wouldn't affect other codecs, I think. |
|||
msg232684 - (view) | Author: STINNER Victor (vstinner) * | Date: 2014-12-15 21:35 | |
I refactored some parts of CJK codecs for performances, after the PEP 393 was implemented. A blocker point was that these codecs have very few tests. Not for valid data but for invalid data. It may be a little bit better. I tried to write a test for each path in if/else, to test all cases, in the codecs that I modified. By error prone, it mean that it's easy to introduce a bug or a regressio, since the code is complex and almost nobody maintains it. I'm not stongly opposed to any change. I'm just trying to understand the context. |
|||
msg232685 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2014-12-15 21:48 | |
Another traditional issue with Japanese codecs is that people have different opinions on what the encoding should do. It may be that when we release the codec, somebody comes up and says that the codec is incorrect, and it should do something different for some code points, citing some other applications which he considers right. In particular for the Microsoft ones, people may claim that some version of Windows did things differently. Now, for this set, the ones that got registered with IANA sound ok (in the sense that it is our bug if they fail to conform to the IANA spec, and IANA's fault if they fail to do what users expect). For the other ones, I wonder whether there is some official source that can be consulted for correctness. On a different note: why do you claim that the code is written by Perky? (it's not you, is it?) |
|||
msg232702 - (view) | Author: Tetsuya Morimoto (t2y) * | Date: 2014-12-16 02:49 | |
> By error prone, it mean that it's easy to introduce a bug or a regression, > since the code is complex and almost nobody maintains it. Indeed. Actually, I encountered some faults when I migrated original patch. The character encoding is a kind of specialty area. This patch is written by Masayuki Moriyama, who is an expert of character encoding and he have been contributed to various communities for a long time. Also, he helps me to migrate original patch(for Python 2.4.3) to Python 3.5. You can see commit log he fixed some bugs. https://bitbucket.org/t2y/cpython/commits/all > I'm not stongly opposed to any change. I'm just trying to understand the > context. Thanks. I'll help it by explaining the context. |
|||
msg232707 - (view) | Author: Tetsuya Morimoto (t2y) * | Date: 2014-12-16 04:18 | |
> Another traditional issue with Japanese codecs is that people have different opinions on what the encoding should do. It may be that when we release the codec, somebody comes up and says that the codec is incorrect, and it should do something different for some code points, citing some other applications which he considers right. In particular for the Microsoft ones, people may claim that some version of Windows did things differently. In regard to e-mail encoding, Japanese should use utf-8, then it resolves most problems. However, for historical reason or compatibility reason, it's different even today. I don't think these legacy codecs are needed for individual application, but we sometimes encounter an encoding issue when an application collaborates to external system like e-mail. > Now, for this set, the ones that got registered with IANA sound ok (in the sense that it is our bug if they fail to conform to the IANA spec, and IANA's fault if they fail to do what users expect). For the other ones, I wonder whether there is some official source that can be consulted for correctness. Exactly. Now, I'm finding euc-jp-ms and iso-2022-jp-ms spec in English. Of course, there's a voluntary document in Japanese as follows. http://www.wdic.org/w/WDIC/eucJP-ms http://www.wdic.org/w/WDIC/ISO-2022-JP-MS I may agree with dropping character encoding which is difficult to find official source. > On a different note: why do you claim that the code is written by Perky? (it's not you, is it?) Right! Because the credit belongs to him. I'm an assistant. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:11 | admin | set | github: 67239 |
2014-12-16 04:18:44 | t2y | set | messages: + msg232707 |
2014-12-16 02:49:37 | t2y | set | messages: + msg232702 |
2014-12-15 21:48:29 | loewis | set | messages: + msg232685 |
2014-12-15 21:35:36 | vstinner | set | messages: + msg232684 |
2014-12-15 17:21:59 | t2y | set | messages: + msg232674 |
2014-12-15 16:45:57 | vstinner | set | nosy:
+ vstinner messages: + msg232668 |
2014-12-14 16:55:24 | serhiy.storchaka | set | nosy:
+ lemburg, loewis, serhiy.storchaka stage: patch review |
2014-12-14 16:28:43 | t2y | set | messages: + msg232640 |
2014-12-14 16:04:39 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg232639 |
2014-12-14 14:34:49 | t2y | create |