Message62384
I have generated compressed mapping tables by several ways.
I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.
The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )
none minimal MSjk MSall current
Text 0 207 312 342 570
Data 904 696 592 562 333
raw-py 3006 2392 2016 1932 996
zip-py 720 496 416 384 304
raw-pyc 952 734 624 590 346
zip-pyc 560 384 336 304 240
Text+zip-pyc 560 591 648 646 810
raw-both 3954 3124 2638 2520 1340
zip-both 1248 864 736 672 512
zip-bare 560 384 336 304 240
tarbz2-bare 496 352 320 304 240
Columns represent which mapping files are separated into external
files. In "none", no mapping is left as static const C data while
only new cns11643 mappings are extracted in "current" column.
"minimal" set has the major character set for each country in
static C data and other are out. And "MSjk" includes some more
MS codepages of Japan and Korea, and "MSall" includes all MS
codepage extensions in static const C data. We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.
"Text" is portion that remains in static const C data where is all
the current mapping tables are in. As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution. "Data" is externally managed
mapping tables.
"raw-py" row shows total volume of mapping tables as in Python
source code. "raw-pyc" shows compiled (pyc) version of mapping
tables. "zip-py" and "zip-pyc" are zip-compressed archive of
"raw-py" and "raw-pyc", respectively. Those can be imported
using python zipimport machinery.
"zip-bare" and "tarbz2-bare" shows volume of archived raw mapping
table files as you can notice from their name.
We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose "zip-pyc" of "minimal" set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.
What do you think about the scheme or
Any other idea for compression? |
|
Date |
User |
Action |
Args |
2008-02-14 09:14:26 | hyeshik.chang | set | spambayes_score: 0.000231277 -> 0.00023127731 recipients:
+ hyeshik.chang, lemburg, loewis, amaury.forgeotdarc, kcwu |
2008-02-14 09:14:25 | hyeshik.chang | set | spambayes_score: 0.000231277 -> 0.000231277 messageid: <1202980465.62.0.0714879420319.issue2066@psf.upfronthosting.co.za> |
2008-02-14 09:14:24 | hyeshik.chang | link | issue2066 messages |
2008-02-14 09:14:21 | hyeshik.chang | create | |
|