Issue2066
Created on 2008-02-11 11:58 by hyeshik.chang, last changed 2010-08-13 01:05 by haypo. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| cns11643-r1.diff.gz | hyeshik.chang, 2008-02-11 11:58 | initial patch | ||
| Messages (21) | |||
|---|---|---|---|
| msg62282 - (view) | Author: Hyeshik Chang (hyeshik.chang) * ![]() |
Date: 2008-02-11 11:58 | |
This patch adds CNS11643 support into Python unicode codecs. CNS11643 is a huge character which is used in EUC-TW and ISO-2022-CN. CJKCodecs have had the CNS11643 support for 4 years at least, but I dropped it because of its huge size in integrating into Python. EUC-TW and ISO-2022-CN aren't being used widely while they are still regarded as part of major encodings yet. In my patch, disabling the CNS11643 charset support is possible by adding -DNO_CNS11643 in CFLAGS for light platforms. Mapping source code size of the charset is 900K and it adds about 350K into _codecs_tw.so (in POSIX) or python26.dll (in Win32). What do you think about adding this code? |
|||
| msg62283 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-02-11 12:15 | |
How often would this character set be needed ? In any case, using a (pre)compiler switch is not a good idea. Please add support to enable/disable the support via a configure switch. |
|||
| msg62284 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2008-02-11 12:15 | |
In this case let's put the cjkcodecs modules in their own DLL(s) on win32. |
|||
| msg62295 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2008-02-11 22:57 | |
I would like to see whether a compression mechanism of the tables could be found. If all else fails, compressing with raw zlib might improve things, but before that, I think other compression techniques should be studied. I'm still -1 on ad-hoc exclusion of extension modules from pythonxy.dll. If this module is to be excluded, a general policy should be established that determines what modules get compiled separately, and an automation mechanism should be established that automates generation of appropriate build infrastructure for modules built separately under this policy. |
|||
| msg62298 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2008-02-11 23:08 | |
BTW, which version of CNS11643 does that implement? AFAICT, there is CNS 11643-1986 and CNS 11643-1992. Where did you get the Unicode mapping from? |
|||
| msg62300 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-02-11 23:57 | |
Some background information: http://www.cns11643.gov.tw/eng/word.jsp The most recent version appears to be: "CNS11643-2004", sometimes also called "CNS11643 version 3" or "CNS11643-3" (http://docs.hp.com/en/5991-7974/5991-7974.pdf). Here's the table for version 1 (1986): ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT Versions 1 and 2 (1992) are also included in the official Unicode Han character database (along with several other mappings): http://www.unicode.org/charts/unihan.html I couldn't find a reference to a version 3 mapping table. |
|||
| msg62302 - (view) | Author: Kuang-che Wu (kcwu) | Date: 2008-02-12 02:31 | |
FYI, according to the new spec of cns11643-2004 (you can search the preview from http://www.cnsonline.com.tw/, at http://www.cnsonline.com.tw/preview/preview.jsp? general_no=1164300&language=C&pagecount=524). From page 499, it mensioned an URL http://www.cnscode.org.tw/ and the version 3 mapping table could be found at http://www.cnscode.org.tw/cnscode/csic_ucs.jsp |
|||
| msg62304 - (view) | Author: Hyeshik Chang (hyeshik.chang) * ![]() |
Date: 2008-02-12 03:25 | |
I've generated the mapping table from ICU's CNS11643-1992 mapping. I see that CNS11643 is quite rarely used in the internet, but it's the only national standard character set in Taiwan. Asking Taiwanese python users, even they didn't think that it's necessary to add into Python. I'll study how much compression is possible and how efficient it is, then submit a revised patch again. Thank you for comments! |
|||
| msg62384 - (view) | Author: Hyeshik Chang (hyeshik.chang) * ![]() |
Date: 2008-02-14 09:14 | |
I have generated compressed mapping tables by several ways. I extracted mapping data into individual files and reorganized them by translating into Python source code or archiving into a zip file. The following table shows the result: (in kilobytes) (also available at http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA ) none minimal MSjk MSall current Text 0 207 312 342 570 Data 904 696 592 562 333 raw-py 3006 2392 2016 1932 996 zip-py 720 496 416 384 304 raw-pyc 952 734 624 590 346 zip-pyc 560 384 336 304 240 Text+zip-pyc 560 591 648 646 810 raw-both 3954 3124 2638 2520 1340 zip-both 1248 864 736 672 512 zip-bare 560 384 336 304 240 tarbz2-bare 496 352 320 304 240 Columns represent which mapping files are separated into external files. In "none", no mapping is left as static const C data while only new cns11643 mappings are extracted in "current" column. "minimal" set has the major character set for each country in static C data and other are out. And "MSjk" includes some more MS codepages of Japan and Korea, and "MSall" includes all MS codepage extensions in static const C data. We may fix the list which character sets remain as C data or let users pick the sets using configure option. "Text" is portion that remains in static const C data where is all the current mapping tables are in. As discussed when CJKCodecs had been integrated into python, it can be shared over processes in a system and efficient, but it can't be compressed or reorganized easily by users for redistribution. "Data" is externally managed mapping tables. "raw-py" row shows total volume of mapping tables as in Python source code. "raw-pyc" shows compiled (pyc) version of mapping tables. "zip-py" and "zip-pyc" are zip-compressed archive of "raw-py" and "raw-pyc", respectively. Those can be imported using python zipimport machinery. "zip-bare" and "tarbz2-bare" shows volume of archived raw mapping table files as you can notice from their name. We have 560KB of mapping tables in the Python CJKCodecs part. If we choose "zip-pyc" of "minimal" set, the binary distribution will be just as big as before even if we include CNS11643 character set and pythonXY.dll will get smaller by 363KB. What do you think about the scheme or Any other idea for compression? |
|||
| msg62385 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-02-14 09:59 | |
I think Martin was looking for other optimizations that still leave the data in a static C const (in order to be shared between processes and only loaded on demand), but do compress the data representation, e.g. using some form of Huffman coding. While I don't see adding a few 100kB of static C data to a DLL as a major problem (even less so, if it's possible to disable support via a configure switch, e.g. for embedded systems), it would be interesting to check whether the lookups tables can be compressed by way of their structure. |
|||
| msg62387 - (view) | Author: Hyeshik Chang (hyeshik.chang) * ![]() |
Date: 2008-02-14 11:30 | |
I couldn't find an appropriate method to implement in situ compressed mapping table. AFAIK, python has the smallest mapping table footprint for each charset among major open source transcoding programs. I have thought about the compression many times, but every neat method required severe performance sacrifice. |
|||
| msg62388 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-02-14 11:33 | |
In that case, I'm +1 on adding it. The OS won't load those tables unless really needed, so it's more a question of disk space than anything else. |
|||
| msg62462 - (view) | Author: Giovanni Bajo (giovannibajo) | Date: 2008-02-16 18:21 | |
Making the standard Windows Python DLL larger is not only a problem of disk size: it will make all packages produced by PyInstaller or py2exe larger, and that means lots of wasted bandwidth. I see that MvL is still -1 on simply splitting CJK codecs out, and vetos it by asking for a generalization work of insane proportion (a hard-to-define PEP, an entirely new build system for Windows, etc.). I understand (and *agree*) that having a general rule would be a much superior solution, but CJK is already almost 50% of the python.dll, so it *is* already a special case by any means. And special cases like these could be handled with special-case decisions. Thus, I still strongly disagree with MvL and would like CJK be split out of python.dll as soon as possible. I would not really ask this for any other modules but CJK, and understand that further actions would really require a PEP and a new build system for Windows. So, I ask again MvL to soften his position and reconsider the CJK splitting in all its singularity. Please! (in case it's not clear, I would prepare a patch to split CJK out anyday if there were hopes that it gets accepted) |
|||
| msg62487 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-02-17 11:07 | |
Whether or not to keep placing all builtin modules into the Windows Python DLL is not really a question to be discussed on the tracker. Given the size of the Python DLL (around 2MB) and the extra 350kB that the support for CNS11643 would cost, I think such a discussion is pretty pointless. I'm still +1 on the basis of enhancing the Taiwanese Python experience by adding their standard character set to the default Python install. |
|||
| msg83563 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-03-14 01:32 | |
Based on the feedback above, it seems this should be committed, shouldn't it? |
|||
| msg83665 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2009-03-17 10:56 | |
On 2009-03-14 02:32, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Based on the feedback above, it seems this should be committed, > shouldn't it? +1 As mentioned several times on the ticket: static C data is not really something to worry about these days. |
|||
| msg83671 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-03-17 12:15 | |
Le mardi 17 mars 2009 à 10:56 +0000, Marc-Andre Lemburg a écrit : > +1 > > As mentioned several times on the ticket: static C data is not really > something to worry about these days. Well, I suggest that someone familiar with the codec-building machinery do the committing, in order to avoid mistakes :-) |
|||
| msg83672 - (view) | Author: Hyeshik Chang (hyeshik.chang) * ![]() |
Date: 2009-03-17 12:30 | |
When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing environment in Taiwan. This will only serve for a historical compatibility and literal standard compliance. I'm quite neutral in adding this into python without any user's request from Taiwan (I'm from South Korea :), but I can finish committing it with pleasure if you are still fond of the codec. |
|||
| msg83675 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2009-03-17 12:54 | |
On 2009-03-17 13:30, Hye-Shik Chang wrote: > Hye-Shik Chang <hyeshik@gmail.com> added the comment: > > When I asked Taiwanese developers how often they use these character > sets, it appeared that they are almost useless in the usual computing > environment in Taiwan. This will only serve for a historical > compatibility and literal standard compliance. I'm quite neutral in > adding this into python without any user's request from Taiwan (I'm from > South Korea :), but I can finish committing it with pleasure if you are > still fond of the codec. If there's no user base for it, then we should not include it. I was under the impression that this charset is essential for the Taiwanese and Chinese (http://www.cns11643.gov.tw/). However, the wiki page http://en.wikipedia.org/wiki/CNS_11643 says "In practice, variants of Big5 are de facto standard.", so perhaps there's no real need for the codec after all. The German version of the wiki page mentions that CNS11643 is the legal standard charset, but not used much in practice because it needs 3 bytes per glyph instead of just 2 for Big5 variants. The Chinese version of the wiki page says more or less the same: http://translate.google.de/translate?hl=en&sl=zh-TW&u=http://zh.wikipedia.org/wiki/%25E5%259C%258B%25E5%25AE%25B6%25E6%25A8%2599%25E6%25BA%2596%25E4%25B8%25AD%25E6%2596%2587%25E4%25BA%25A4%25E6%258F%259B%25E7%25A2%25BC&ei=C52_SZepPJKTsAbw8PW5DQ&sa=X&oi=translate&resnum=1&ct=result&prev=/search%3Fq%3Dhttp://zh.wikipedia.org/wiki/%2525E5%25259C%25258B%2525E5%2525AE%2525B6%2525E6%2525A8%252599%2525E6%2525BA%252596%2525E4%2525B8%2525AD%2525E6%252596%252587%2525E4%2525BA%2525A4%2525E6%25258F%25259B%2525E7%2525A2%2525BC%26hl%3Den%26sa%3DG |
|||
| msg113380 - (view) | Author: Terry J. Reedy (terry.reedy) * ![]() |
Date: 2010-08-09 04:35 | |
It seems to me that the last few messages suggest that this should be closed. |
|||
| msg113731 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2010-08-13 01:05 | |
Hyeshik Chang, who opened this issue, wrote (msg83672) "When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing environment in Taiwan. This will only serve for a historical compatibility and literal standard compliance. (...)" I don't think that Python is the right place to support such encoding. Eg. a patch for iconv would be a better idea (if iconv doesn't support this encoding yet). I close this issue as "wont fix". |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2010-08-13 01:05:25 | haypo | set | status: open -> closed resolution: wont fix messages: + msg113731 |
| 2010-08-09 04:35:12 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg113380 versions: + Python 3.2, - Python 3.1, Python 2.7 |
| 2009-03-17 12:54:24 | lemburg | set | messages:
+ msg83675 title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
| 2009-03-17 12:30:25 | hyeshik.chang | set | messages: + msg83672 |
| 2009-03-17 12:15:43 | pitrou | set | messages:
+ msg83671 title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
| 2009-03-17 12:05:38 | haypo | set | nosy:
+ haypo |
| 2009-03-17 10:56:28 | lemburg | set | messages:
+ msg83665 title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
| 2009-03-14 01:32:11 | pitrou | set | versions:
+ Python 3.1, Python 2.7, - Python 2.6, Python 3.0 nosy: + pitrou messages: + msg83563 type: enhancement stage: commit review |
| 2008-02-17 11:07:29 | lemburg | set | messages: + msg62487 |
| 2008-02-16 18:21:24 | giovannibajo | set | nosy:
+ giovannibajo messages: + msg62462 |
| 2008-02-14 11:33:45 | lemburg | set | messages: + msg62388 |
| 2008-02-14 11:30:09 | hyeshik.chang | set | messages: + msg62387 |
| 2008-02-14 09:59:46 | lemburg | set | messages: + msg62385 |
| 2008-02-14 09:14:24 | hyeshik.chang | set | messages: + msg62384 |
| 2008-02-12 03:25:23 | hyeshik.chang | set | messages: + msg62304 |
| 2008-02-12 02:31:32 | kcwu | set | nosy:
+ kcwu messages: + msg62302 |
| 2008-02-11 23:57:07 | lemburg | set | messages: + msg62300 |
| 2008-02-11 23:08:17 | loewis | set | messages: + msg62298 |
| 2008-02-11 22:57:27 | loewis | set | nosy:
+ loewis messages: + msg62295 |
| 2008-02-11 12:15:53 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg62284 |
| 2008-02-11 12:15:17 | lemburg | set | nosy:
+ lemburg messages: + msg62283 |
| 2008-02-11 11:59:25 | hyeshik.chang | set | title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
| 2008-02-11 11:58:54 | hyeshik.chang | create | |
