msg62282 - (view) |
Author: Hyeshik Chang (hyeshik.chang) * |
Date: 2008-02-11 11:58 |
This patch adds CNS11643 support into Python unicode codecs.
CNS11643 is a huge character which is used in EUC-TW and ISO-2022-CN.
CJKCodecs have had the CNS11643 support for 4 years at least,
but I dropped it because of its huge size in integrating into Python.
EUC-TW and ISO-2022-CN aren't being used widely while they are
still regarded as part of major encodings yet.
In my patch, disabling the CNS11643 charset support is possible by
adding -DNO_CNS11643 in CFLAGS for light platforms. Mapping source
code size of the charset is 900K and it adds about 350K into
_codecs_tw.so (in POSIX) or python26.dll (in Win32).
What do you think about adding this code?
|
msg62283 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2008-02-11 12:15 |
How often would this character set be needed ?
In any case, using a (pre)compiler switch is not a good idea. Please add
support to enable/disable the support via a configure switch.
|
msg62284 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2008-02-11 12:15 |
In this case let's put the cjkcodecs modules in their own
DLL(s) on win32.
|
msg62295 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-02-11 22:57 |
I would like to see whether a compression mechanism of the tables could
be found. If all else fails, compressing with raw zlib might improve
things, but before that, I think other compression techniques should be
studied.
I'm still -1 on ad-hoc exclusion of extension modules from pythonxy.dll.
If this module is to be excluded, a general policy should be established
that determines what modules get compiled separately, and an automation
mechanism should be established that automates generation of appropriate
build infrastructure for modules built separately under this policy.
|
msg62298 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-02-11 23:08 |
BTW, which version of CNS11643 does that implement? AFAICT, there is CNS
11643-1986 and CNS 11643-1992. Where did you get the Unicode mapping from?
|
msg62300 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2008-02-11 23:57 |
Some background information: http://www.cns11643.gov.tw/eng/word.jsp
The most recent version appears to be: "CNS11643-2004", sometimes also
called "CNS11643 version 3" or "CNS11643-3"
(http://docs.hp.com/en/5991-7974/5991-7974.pdf).
Here's the table for version 1 (1986):
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
Versions 1 and 2 (1992) are also included in the official Unicode Han
character database (along with several other mappings):
http://www.unicode.org/charts/unihan.html
I couldn't find a reference to a version 3 mapping table.
|
msg62302 - (view) |
Author: Kuang-che Wu (kcwu) |
Date: 2008-02-12 02:31 |
FYI, according to the new spec of cns11643-2004 (you can search the
preview from http://www.cnsonline.com.tw/, at
http://www.cnsonline.com.tw/preview/preview.jsp?
general_no=1164300&language=C&pagecount=524).
From page 499, it mensioned an URL http://www.cnscode.org.tw/ and the
version 3 mapping table could be found at
http://www.cnscode.org.tw/cnscode/csic_ucs.jsp
|
msg62304 - (view) |
Author: Hyeshik Chang (hyeshik.chang) * |
Date: 2008-02-12 03:25 |
I've generated the mapping table from ICU's CNS11643-1992 mapping.
I see that CNS11643 is quite rarely used in the internet, but it's the
only national standard character set in Taiwan. Asking Taiwanese
python users, even they didn't think that it's necessary to add into
Python. I'll study how much compression is possible and how efficient
it is, then submit a revised patch again.
Thank you for comments!
|
msg62384 - (view) |
Author: Hyeshik Chang (hyeshik.chang) * |
Date: 2008-02-14 09:14 |
I have generated compressed mapping tables by several ways.
I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.
The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )
none minimal MSjk MSall current
Text 0 207 312 342 570
Data 904 696 592 562 333
raw-py 3006 2392 2016 1932 996
zip-py 720 496 416 384 304
raw-pyc 952 734 624 590 346
zip-pyc 560 384 336 304 240
Text+zip-pyc 560 591 648 646 810
raw-both 3954 3124 2638 2520 1340
zip-both 1248 864 736 672 512
zip-bare 560 384 336 304 240
tarbz2-bare 496 352 320 304 240
Columns represent which mapping files are separated into external
files. In "none", no mapping is left as static const C data while
only new cns11643 mappings are extracted in "current" column.
"minimal" set has the major character set for each country in
static C data and other are out. And "MSjk" includes some more
MS codepages of Japan and Korea, and "MSall" includes all MS
codepage extensions in static const C data. We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.
"Text" is portion that remains in static const C data where is all
the current mapping tables are in. As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution. "Data" is externally managed
mapping tables.
"raw-py" row shows total volume of mapping tables as in Python
source code. "raw-pyc" shows compiled (pyc) version of mapping
tables. "zip-py" and "zip-pyc" are zip-compressed archive of
"raw-py" and "raw-pyc", respectively. Those can be imported
using python zipimport machinery.
"zip-bare" and "tarbz2-bare" shows volume of archived raw mapping
table files as you can notice from their name.
We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose "zip-pyc" of "minimal" set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.
What do you think about the scheme or
Any other idea for compression?
|
msg62385 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2008-02-14 09:59 |
I think Martin was looking for other optimizations that still leave the
data in a static C const (in order to be shared between processes and
only loaded on demand), but do compress the data representation, e.g.
using some form of Huffman coding.
While I don't see adding a few 100kB of static C data to a DLL as a
major problem (even less so, if it's possible to disable support via a
configure switch, e.g. for embedded systems), it would be interesting to
check whether the lookups tables can be compressed by way of their
structure.
|
msg62387 - (view) |
Author: Hyeshik Chang (hyeshik.chang) * |
Date: 2008-02-14 11:30 |
I couldn't find an appropriate method to implement in situ
compressed mapping table. AFAIK, python has the smallest
mapping table footprint for each charset among major open
source transcoding programs. I have thought about the
compression many times, but every neat method required
severe performance sacrifice.
|
msg62388 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2008-02-14 11:33 |
In that case, I'm +1 on adding it.
The OS won't load those tables unless really needed, so it's more a
question of disk space than anything else.
|
msg62462 - (view) |
Author: Giovanni Bajo (giovannibajo) |
Date: 2008-02-16 18:21 |
Making the standard Windows Python DLL larger is not only a problem of
disk size: it will make all packages produced by PyInstaller or py2exe
larger, and that means lots of wasted bandwidth.
I see that MvL is still -1 on simply splitting CJK codecs out, and vetos
it by asking for a generalization work of insane proportion (a
hard-to-define PEP, an entirely new build system for Windows, etc.).
I understand (and *agree*) that having a general rule would be a much
superior solution, but CJK is already almost 50% of the python.dll, so
it *is* already a special case by any means. And special cases like
these could be handled with special-case decisions.
Thus, I still strongly disagree with MvL and would like CJK be split out
of python.dll as soon as possible. I would not really ask this for any
other modules but CJK, and understand that further actions would really
require a PEP and a new build system for Windows.
So, I ask again MvL to soften his position and reconsider the CJK
splitting in all its singularity. Please!
(in case it's not clear, I would prepare a patch to split CJK out anyday
if there were hopes that it gets accepted)
|
msg62487 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2008-02-17 11:07 |
Whether or not to keep placing all builtin modules into the Windows
Python DLL is not really a question to be discussed on the tracker.
Given the size of the Python DLL (around 2MB) and the extra 350kB that
the support for CNS11643 would cost, I think such a discussion is pretty
pointless.
I'm still +1 on the basis of enhancing the Taiwanese Python experience
by adding their standard character set to the default Python install.
|
msg83563 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-14 01:32 |
Based on the feedback above, it seems this should be committed,
shouldn't it?
|
msg83665 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2009-03-17 10:56 |
On 2009-03-14 02:32, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Based on the feedback above, it seems this should be committed,
> shouldn't it?
+1
As mentioned several times on the ticket: static C data is not really
something to worry about these days.
|
msg83671 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-17 12:15 |
Le mardi 17 mars 2009 à 10:56 +0000, Marc-Andre Lemburg a écrit :
> +1
>
> As mentioned several times on the ticket: static C data is not really
> something to worry about these days.
Well, I suggest that someone familiar with the codec-building machinery
do the committing, in order to avoid mistakes :-)
|
msg83672 - (view) |
Author: Hyeshik Chang (hyeshik.chang) * |
Date: 2009-03-17 12:30 |
When I asked Taiwanese developers how often they use these character
sets, it appeared that they are almost useless in the usual computing
environment in Taiwan. This will only serve for a historical
compatibility and literal standard compliance. I'm quite neutral in
adding this into python without any user's request from Taiwan (I'm from
South Korea :), but I can finish committing it with pleasure if you are
still fond of the codec.
|
msg83675 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2009-03-17 12:54 |
On 2009-03-17 13:30, Hye-Shik Chang wrote:
> Hye-Shik Chang <hyeshik@gmail.com> added the comment:
>
> When I asked Taiwanese developers how often they use these character
> sets, it appeared that they are almost useless in the usual computing
> environment in Taiwan. This will only serve for a historical
> compatibility and literal standard compliance. I'm quite neutral in
> adding this into python without any user's request from Taiwan (I'm from
> South Korea :), but I can finish committing it with pleasure if you are
> still fond of the codec.
If there's no user base for it, then we should not include it.
I was under the impression that this charset is essential for the Taiwanese
and Chinese (http://www.cns11643.gov.tw/).
However, the wiki page http://en.wikipedia.org/wiki/CNS_11643
says "In practice, variants of Big5 are de facto standard.", so perhaps
there's no real need for the codec after all.
The German version of the wiki page mentions that CNS11643 is the legal
standard charset, but not used much in practice because it needs 3 bytes
per glyph instead of just 2 for Big5 variants.
The Chinese version of the wiki page says more or less the same:
http://translate.google.de/translate?hl=en&sl=zh-TW&u=http://zh.wikipedia.org/wiki/%25E5%259C%258B%25E5%25AE%25B6%25E6%25A8%2599%25E6%25BA%2596%25E4%25B8%25AD%25E6%2596%2587%25E4%25BA%25A4%25E6%258F%259B%25E7%25A2%25BC&ei=C52_SZepPJKTsAbw8PW5DQ&sa=X&oi=translate&resnum=1&ct=result&prev=/search%3Fq%3Dhttp://zh.wikipedia.org/wiki/%2525E5%25259C%25258B%2525E5%2525AE%2525B6%2525E6%2525A8%252599%2525E6%2525BA%252596%2525E4%2525B8%2525AD%2525E6%252596%252587%2525E4%2525BA%2525A4%2525E6%25258F%25259B%2525E7%2525A2%2525BC%26hl%3Den%26sa%3DG
|
msg113380 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2010-08-09 04:35 |
It seems to me that the last few messages suggest that this should be closed.
|
msg113731 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-08-13 01:05 |
Hyeshik Chang, who opened this issue, wrote (msg83672) "When I asked Taiwanese developers how often they use these character sets, it appeared that they are almost useless in the usual computing
environment in Taiwan. This will only serve for a historical
compatibility and literal standard compliance. (...)"
I don't think that Python is the right place to support such encoding. Eg. a patch for iconv would be a better idea (if iconv doesn't support this encoding yet).
I close this issue as "wont fix".
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:30 | admin | set | github: 46342 |
2013-07-24 10:42:18 | jwilk | set | nosy:
+ jwilk
|
2010-08-13 01:05:25 | vstinner | set | status: open -> closed resolution: wont fix messages:
+ msg113731
|
2010-08-09 04:35:12 | terry.reedy | set | nosy:
+ terry.reedy
messages:
+ msg113380 versions:
+ Python 3.2, - Python 3.1, Python 2.7 |
2009-03-17 12:54:24 | lemburg | set | messages:
+ msg83675 title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
2009-03-17 12:30:25 | hyeshik.chang | set | messages:
+ msg83672 |
2009-03-17 12:15:43 | pitrou | set | messages:
+ msg83671 title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
2009-03-17 12:05:38 | vstinner | set | nosy:
+ vstinner
|
2009-03-17 10:56:28 | lemburg | set | messages:
+ msg83665 title: Adding new CNS11643, a *huge* charset, support in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
2009-03-14 01:32:11 | pitrou | set | versions:
+ Python 3.1, Python 2.7, - Python 2.6, Python 3.0 nosy:
+ pitrou
messages:
+ msg83563
type: enhancement stage: commit review |
2008-02-17 11:07:29 | lemburg | set | messages:
+ msg62487 |
2008-02-16 18:21:24 | giovannibajo | set | nosy:
+ giovannibajo messages:
+ msg62462 |
2008-02-14 11:33:45 | lemburg | set | messages:
+ msg62388 |
2008-02-14 11:30:09 | hyeshik.chang | set | messages:
+ msg62387 |
2008-02-14 09:59:46 | lemburg | set | messages:
+ msg62385 |
2008-02-14 09:14:24 | hyeshik.chang | set | messages:
+ msg62384 |
2008-02-12 03:25:23 | hyeshik.chang | set | messages:
+ msg62304 |
2008-02-12 02:31:32 | kcwu | set | nosy:
+ kcwu messages:
+ msg62302 |
2008-02-11 23:57:07 | lemburg | set | messages:
+ msg62300 |
2008-02-11 23:08:17 | loewis | set | messages:
+ msg62298 |
2008-02-11 22:57:27 | loewis | set | nosy:
+ loewis messages:
+ msg62295 |
2008-02-11 12:15:53 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg62284 |
2008-02-11 12:15:17 | lemburg | set | nosy:
+ lemburg messages:
+ msg62283 |
2008-02-11 11:59:25 | hyeshik.chang | set | title: Adding new CNS11643 support, a *huge* charset, in cjkcodecs -> Adding new CNS11643, a *huge* charset, support in cjkcodecs |
2008-02-11 11:58:54 | hyeshik.chang | create | |