This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author hyeshik.chang
Recipients amaury.forgeotdarc, hyeshik.chang, kcwu, lemburg, loewis
Date 2008-02-14.09:14:21
SpamBayes Score 0.00023127731
Marked as misclassified No
Message-id <1202980465.62.0.0714879420319.issue2066@psf.upfronthosting.co.za>
In-reply-to
Content
I have generated compressed mapping tables by several ways.

I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.

The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )

                none    minimal MSjk    MSall   current
Text            0       207     312     342     570 
Data            904     696     592     562     333 
                                            
raw-py          3006    2392    2016    1932    996 
zip-py          720     496     416     384     304 
                                            
raw-pyc         952     734     624     590     346 
zip-pyc         560     384     336     304     240 
Text+zip-pyc    560     591     648     646     810 
                                            
raw-both        3954    3124    2638    2520    1340
zip-both        1248    864     736     672     512 
                                               
zip-bare        560     384     336     304     240 
tarbz2-bare     496     352     320     304     240 

Columns represent which mapping files are separated into external
files.  In "none", no mapping is left as static const C data while
only new cns11643 mappings are extracted in "current" column.
"minimal" set has the major character set for each country in
static C data and other are out.  And "MSjk" includes some more
MS codepages of Japan and Korea, and "MSall" includes all MS
codepage extensions in static const C data.  We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.

"Text" is portion that remains in static const C data where is all
the current mapping tables are in.  As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution.  "Data" is externally managed
mapping tables.

"raw-py" row shows total volume of mapping tables as in Python
source code.  "raw-pyc" shows compiled (pyc) version of mapping
tables.  "zip-py" and "zip-pyc" are zip-compressed archive of
"raw-py" and "raw-pyc", respectively.  Those can be imported
using python zipimport machinery.

"zip-bare" and "tarbz2-bare" shows volume of archived raw mapping
table files as you can notice from their name.

We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose "zip-pyc" of "minimal" set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.

What do you think about the scheme or
Any other idea for compression?
History
Date User Action Args
2008-02-14 09:14:26hyeshik.changsetspambayes_score: 0.000231277 -> 0.00023127731
recipients: + hyeshik.chang, lemburg, loewis, amaury.forgeotdarc, kcwu
2008-02-14 09:14:25hyeshik.changsetspambayes_score: 0.000231277 -> 0.000231277
messageid: <1202980465.62.0.0714879420319.issue2066@psf.upfronthosting.co.za>
2008-02-14 09:14:24hyeshik.changlinkissue2066 messages
2008-02-14 09:14:21hyeshik.changcreate