classification
Title: Adding new CNS11643, a *huge* charset, support in cjkcodecs
Type: enhancement Stage: commit review
Components: Unicode Versions: Python 3.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, giovannibajo, haypo, hyeshik.chang, jwilk, kcwu, lemburg, loewis, pitrou, terry.reedy
Priority: low Keywords:

Created on 2008-02-11 11:58 by hyeshik.chang, last changed 2013-07-24 10:42 by jwilk. This issue is now closed.

Files
File name Uploaded Description Edit
cns11643-r1.diff.gz hyeshik.chang, 2008-02-11 11:58 initial patch
Messages (21)
msg62282 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2008-02-11 11:58
This patch adds CNS11643 support into Python unicode codecs.
CNS11643 is a huge character which is used in EUC-TW and ISO-2022-CN.
CJKCodecs have had the CNS11643 support for 4 years at least,
but I dropped it because of its huge size in integrating into Python.
EUC-TW and ISO-2022-CN aren't being used widely while they are
still regarded as part of major encodings yet.

In my patch, disabling the CNS11643 charset support is possible by
adding -DNO_CNS11643 in CFLAGS for light platforms. Mapping source
code size of the charset is 900K and it adds about 350K into
_codecs_tw.so (in POSIX) or python26.dll (in Win32).

What do you think about adding this code?
msg62283 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-11 12:15
How often would this character set be needed ?

In any case, using a (pre)compiler switch is not a good idea. Please add
support to enable/disable the support via a configure switch.
msg62284 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-02-11 12:15
In this case let's put the cjkcodecs modules in their own
DLL(s) on win32.
msg62295 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-02-11 22:57
I would like to see whether a compression mechanism of the tables could
be found. If all else fails, compressing with raw zlib might improve
things, but before that, I think other compression techniques should be
studied.

I'm still -1 on ad-hoc exclusion of extension modules from pythonxy.dll.
If this module is to be excluded, a general policy should be established
that determines what modules get compiled separately, and an automation
mechanism should be established that automates generation of appropriate
build infrastructure for modules built separately under this policy.
msg62298 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-02-11 23:08
BTW, which version of CNS11643 does that implement? AFAICT, there is CNS
11643-1986 and CNS 11643-1992. Where did you get the Unicode mapping from?
msg62300 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-11 23:57
Some background information: http://www.cns11643.gov.tw/eng/word.jsp

The most recent version appears to be: "CNS11643-2004", sometimes also
called "CNS11643 version 3" or "CNS11643-3"
(http://docs.hp.com/en/5991-7974/5991-7974.pdf).

Here's the table for version 1 (1986):
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT

Versions 1 and 2 (1992) are also included in the official Unicode Han
character database (along with several other mappings):
http://www.unicode.org/charts/unihan.html

I couldn't find a reference to a version 3 mapping table.
msg62302 - (view) Author: Kuang-che Wu (kcwu) Date: 2008-02-12 02:31
FYI, according to the new spec of cns11643-2004 (you can search the 
preview from http://www.cnsonline.com.tw/, at 
http://www.cnsonline.com.tw/preview/preview.jsp?
general_no=1164300&language=C&pagecount=524).
From page 499, it mensioned an URL http://www.cnscode.org.tw/ and the 
version 3 mapping table could be found at 
http://www.cnscode.org.tw/cnscode/csic_ucs.jsp
msg62304 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2008-02-12 03:25
I've generated the mapping table from ICU's CNS11643-1992 mapping.
I see that CNS11643 is quite rarely used in the internet, but it's the
only national standard character set in Taiwan.  Asking Taiwanese
python users, even they didn't think that it's necessary to add into
Python.  I'll study how much compression is possible and how efficient
it is, then submit a revised patch again.

Thank you for comments!
msg62384 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2008-02-14 09:14
I have generated compressed mapping tables by several ways.

I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.

The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )

                none    minimal MSjk    MSall   current
Text            0       207     312     342     570 
Data            904     696     592     562     333 
                                            
raw-py          3006    2392    2016    1932    996 
zip-py          720     496     416     384     304 
                                            
raw-pyc         952     734     624     590     346 
zip-pyc         560     384     336     304     240 
Text+zip-pyc    560     591     648     646     810 
                                            
raw-both        3954    3124    2638    2520    1340
zip-both        1248    864     736     672     512 
                                               
zip-bare        560     384     336     304     240 
tarbz2-bare     496     352     320     304     240 

Columns represent which mapping files are separated into external
files.  In "none", no mapping is left as static const C data while
only new cns11643 mappings are extracted in "current" column.
"minimal" set has the major character set for each country in
static C data and other are out.  And "MSjk" includes some more
MS codepages of Japan and Korea, and "MSall" includes all MS
codepage extensions in static const C data.  We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.

"Text" is portion that remains in static const C data where is all
the current mapping tables are in.  As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution.  "Data" is externally managed
mapping tables.

"raw-py" row shows total volume of mapping tables as in Python
source code.  "raw-pyc" shows compiled (pyc) version of mapping
tables.  "zip-py" and "zip-pyc" are zip-compressed archive of
"raw-py" and "raw-pyc", respectively.  Those can be imported
using python zipimport machinery.

"zip-bare" and "tarbz2-bare" shows volume of archived raw mapping
table files as you can notice from their name.

We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose "zip-pyc" of "minimal" set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.

What do you think about the scheme or
Any other idea for compression?
msg62385 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-14 09:59
I think Martin was looking for other optimizations that still leave the
data in a static C const (in order to be shared between processes and
only loaded on demand), but do compress the data representation, e.g.
using some form of Huffman coding.

While I don't see adding a few 100kB of static C data to a DLL as a
major problem (even less so, if it's possible to disable support via a
configure switch, e.g. for embedded systems), it would be interesting to
check whether the lookups tables can be compressed by way of their
structure.
msg62387 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2008-02-14 11:30
I couldn't find an appropriate method to implement in situ
compressed mapping table.  AFAIK, python has the smallest
mapping table footprint for each charset among major open
source transcoding programs.  I have thought about the
compression many times, but every neat method required
severe performance sacrifice.
msg62388 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-14 11:33
In that case, I'm +1 on adding it.

The OS won't load those tables unless really needed, so it's more a
question of disk space than anything else.
msg62462 - (view) Author: Giovanni Bajo (giovannibajo) Date: 2008-02-16 18:21
Making the standard Windows Python DLL larger is not only a problem of
disk size: it will make all packages produced by PyInstaller or py2exe
larger, and that means lots of wasted bandwidth.

I see that MvL is still -1 on simply splitting CJK codecs out, and vetos
it by asking for a generalization work of insane proportion (a
hard-to-define PEP, an entirely new build system for Windows, etc.).

I understand (and *agree*) that having a general rule would be a much
superior solution, but CJK is already almost 50% of the python.dll, so
it *is* already a special case by any means. And special cases like
these  could be handled with special-case decisions.

Thus, I still strongly disagree with MvL and would like CJK be split out
 of python.dll as soon as possible. I would not really ask this for any
other modules but CJK, and understand that further actions would really
require a PEP and a new build system for Windows.

So, I ask again MvL to soften his position and reconsider the CJK
splitting in all its singularity. Please!

(in case it's not clear, I would prepare a patch to split CJK out anyday
if there were hopes that it gets accepted)
msg62487 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-17 11:07
Whether or not to keep placing all builtin modules into the Windows
Python DLL is not really a question to be discussed on the tracker.
Given the size of the Python DLL (around 2MB) and the extra 350kB that
the support for CNS11643 would cost, I think such a discussion is pretty
pointless.

I'm still +1 on the basis of enhancing the Taiwanese Python experience
by adding their standard character set to the default Python install.
msg83563 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-14 01:32
Based on the feedback above, it seems this should be committed,
shouldn't it?
msg83665 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-03-17 10:56
On 2009-03-14 02:32, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Based on the feedback above, it seems this should be committed,
> shouldn't it?

+1

As mentioned several times on the ticket: static C data is not really
something to worry about these days.
msg83671 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-17 12:15
Le mardi 17 mars 2009 à 10:56 +0000, Marc-Andre Lemburg a écrit :
> +1
> 
> As mentioned several times on the ticket: static C data is not really
> something to worry about these days.

Well, I suggest that someone familiar with the codec-building machinery
do the committing, in order to avoid mistakes :-)
msg83672 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2009-03-17 12:30
When I asked Taiwanese developers how often they use these character
sets, it appeared that they are almost useless in the usual computing
environment in Taiwan.  This will only serve for a historical
compatibility and literal standard compliance.  I'm quite neutral in
adding this into python without any user's request from Taiwan (I'm from
South Korea :), but I can finish committing it with pleasure if you are
still fond of the codec.
msg83675 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-03-17 12:54
On 2009-03-17 13:30, Hye-Shik Chang wrote:
> Hye-Shik Chang <hyeshik@gmail.com> added the comment:
> 
> When I asked Taiwanese developers how often they use these character
> sets, it appeared that they are almost useless in the usual computing
> environment in Taiwan.  This will only serve for a historical
> compatibility and literal standard compliance.  I'm quite neutral in
> adding this into python without any user's request from Taiwan (I'm from
> South Korea :), but I can finish committing it with pleasure if you are
> still fond of the codec.

If there's no user base for it, then we should not include it.

I was under the impression that this charset is essential for the Taiwanese
and Chinese (http://www.cns11643.gov.tw/).

However, the wiki page http://en.wikipedia.org/wiki/CNS_11643
says "In practice, variants of Big5 are de facto standard.", so perhaps
there's no real need for the codec after all.

The German version of the wiki page mentions that CNS11643 is the legal
standard charset, but not used much in practice because it needs 3 bytes
per glyph instead of just 2 for Big5 variants.

The Chinese version of the wiki page says more or less the same:

http://translate.google.de/translate?hl=en&sl=zh-TW&u=http://zh.wikipedia.org/wiki/%25E5%259C%258B%25E5%25AE%25B6%25E6%25A8%2599%25E6%25BA%2596%25E4%25B8%25AD%25E6%2596%2587%25E4%25BA%25A4%25E6%258F%259B%25E7%25A2%25BC&ei=C52_SZepPJKTsAbw8PW5DQ&sa=X&oi=translate&resnum=1&ct=result&prev=/search%3Fq%3Dhttp://zh.wikipedia.org/wiki/%2525E5%25259C%25258B%2525E5%2525AE%2525B6%2525E6%2525A8%252599%2525E6%2525BA%252596%2525E4%2525B8%2525AD%2525E6%252596%252587%2525E4%2525BA%2525A4%2525E6%25258F%25259B%2525E7%2525A2%2525BC%26hl%3Den%26sa%3DG
msg113380 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-09 04:35
It seems to me that the last few messages suggest that this should be closed.
msg113731 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-13 01:05
Hyeshik Chang, who opened this issue, wrote (msg83672) "When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing
environment in Taiwan. This will only serve for a historical
compatibility and literal standard compliance. (...)"

I don't think that Python is the right place to support such encoding. Eg. a patch for iconv would be a better idea (if iconv doesn't support this encoding yet).

I close this issue as "wont fix".
History
Date User Action Args
2013-07-24 10:42:18jwilksetnosy: + jwilk
2010-08-13 01:05:25hayposetstatus: open -> closed
resolution: wont fix
messages: + msg113731
2010-08-09 04:35:12terry.reedysetnosy: + terry.reedy

messages: + msg113380
versions: + Python 3.2, - Python 3.1, Python 2.7
2009-03-17 12:54:24lemburgsetmessages: + msg83675
title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs
2009-03-17 12:30:25hyeshik.changsetmessages: + msg83672
2009-03-17 12:15:43pitrousetmessages: + msg83671
title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs
2009-03-17 12:05:38hayposetnosy: + haypo
2009-03-17 10:56:28lemburgsetmessages: + msg83665
title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs
2009-03-14 01:32:11pitrousetversions: + Python 3.1, Python 2.7, - Python 2.6, Python 3.0
nosy: + pitrou

messages: + msg83563

type: enhancement
stage: commit review
2008-02-17 11:07:29lemburgsetmessages: + msg62487
2008-02-16 18:21:24giovannibajosetnosy: + giovannibajo
messages: + msg62462
2008-02-14 11:33:45lemburgsetmessages: + msg62388
2008-02-14 11:30:09hyeshik.changsetmessages: + msg62387
2008-02-14 09:59:46lemburgsetmessages: + msg62385
2008-02-14 09:14:24hyeshik.changsetmessages: + msg62384
2008-02-12 03:25:23hyeshik.changsetmessages: + msg62304
2008-02-12 02:31:32kcwusetnosy: + kcwu
messages: + msg62302
2008-02-11 23:57:07lemburgsetmessages: + msg62300
2008-02-11 23:08:17loewissetmessages: + msg62298
2008-02-11 22:57:27loewissetnosy: + loewis
messages: + msg62295
2008-02-11 12:15:53amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg62284
2008-02-11 12:15:17lemburgsetnosy: + lemburg
messages: + msg62283
2008-02-11 11:59:25hyeshik.changsettitle: Adding new CNS11643 support, a *huge* charset, in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs
2008-02-11 11:58:54hyeshik.changcreate