Message 62384 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	hyeshik.chang
Recipients	amaury.forgeotdarc, hyeshik.chang, kcwu, lemburg, loewis
Date	2008-02-14.09:14:21
SpamBayes Score	0.00023127731
Marked as misclassified	No
Message-id	<1202980465.62.0.0714879420319.issue2066@psf.upfronthosting.co.za>
In-reply-to

Content
I have generated compressed mapping tables by several ways. I extracted mapping data into individual files and reorganized them by translating into Python source code or archiving into a zip file. The following table shows the result: (in kilobytes) (also available at http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA ) none minimal MSjk MSall current Text 0 207 312 342 570 Data 904 696 592 562 333 raw-py 3006 2392 2016 1932 996 zip-py 720 496 416 384 304 raw-pyc 952 734 624 590 346 zip-pyc 560 384 336 304 240 Text+zip-pyc 560 591 648 646 810 raw-both 3954 3124 2638 2520 1340 zip-both 1248 864 736 672 512 zip-bare 560 384 336 304 240 tarbz2-bare 496 352 320 304 240 Columns represent which mapping files are separated into external files. In "none", no mapping is left as static const C data while only new cns11643 mappings are extracted in "current" column. "minimal" set has the major character set for each country in static C data and other are out. And "MSjk" includes some more MS codepages of Japan and Korea, and "MSall" includes all MS codepage extensions in static const C data. We may fix the list which character sets remain as C data or let users pick the sets using configure option. "Text" is portion that remains in static const C data where is all the current mapping tables are in. As discussed when CJKCodecs had been integrated into python, it can be shared over processes in a system and efficient, but it can't be compressed or reorganized easily by users for redistribution. "Data" is externally managed mapping tables. "raw-py" row shows total volume of mapping tables as in Python source code. "raw-pyc" shows compiled (pyc) version of mapping tables. "zip-py" and "zip-pyc" are zip-compressed archive of "raw-py" and "raw-pyc", respectively. Those can be imported using python zipimport machinery. "zip-bare" and "tarbz2-bare" shows volume of archived raw mapping table files as you can notice from their name. We have 560KB of mapping tables in the Python CJKCodecs part. If we choose "zip-pyc" of "minimal" set, the binary distribution will be just as big as before even if we include CNS11643 character set and pythonXY.dll will get smaller by 363KB. What do you think about the scheme or Any other idea for compression?

I have generated compressed mapping tables by several ways.

I extracted mapping data into individual files and reorganized
them by translating into Python source code or archiving into a zip file.

The following table shows the result: (in kilobytes)
(also available at
http://spreadsheets.google.com/pub?key=pWRBaY2ZM7mRgddF0Itd2IA )

                none    minimal MSjk    MSall   current
Text            0       207     312     342     570 
Data            904     696     592     562     333 
                                            
raw-py          3006    2392    2016    1932    996 
zip-py          720     496     416     384     304 
                                            
raw-pyc         952     734     624     590     346 
zip-pyc         560     384     336     304     240 
Text+zip-pyc    560     591     648     646     810 
                                            
raw-both        3954    3124    2638    2520    1340
zip-both        1248    864     736     672     512 
                                               
zip-bare        560     384     336     304     240 
tarbz2-bare     496     352     320     304     240 

Columns represent which mapping files are separated into external
files.  In "none", no mapping is left as static const C data while
only new cns11643 mappings are extracted in "current" column.
"minimal" set has the major character set for each country in
static C data and other are out.  And "MSjk" includes some more
MS codepages of Japan and Korea, and "MSall" includes all MS
codepage extensions in static const C data.  We may fix the list
which character sets remain as C data or let users pick the sets
using configure option.

"Text" is portion that remains in static const C data where is all
the current mapping tables are in.  As discussed when CJKCodecs had
been integrated into python, it can be shared over processes in a
system and efficient, but it can't be compressed or reorganized
easily by users for redistribution.  "Data" is externally managed
mapping tables.

"raw-py" row shows total volume of mapping tables as in Python
source code.  "raw-pyc" shows compiled (pyc) version of mapping
tables.  "zip-py" and "zip-pyc" are zip-compressed archive of
"raw-py" and "raw-pyc", respectively.  Those can be imported
using python zipimport machinery.

"zip-bare" and "tarbz2-bare" shows volume of archived raw mapping
table files as you can notice from their name.

We have 560KB of mapping tables in the Python CJKCodecs part.
If we choose "zip-pyc" of "minimal" set, the binary distribution
will be just as big as before even if we include CNS11643 character
set and pythonXY.dll will get smaller by 363KB.

What do you think about the scheme or
Any other idea for compression?

History
Date	User	Action	Args
2008-02-14 09:14:26	hyeshik.chang	set	spambayes_score: 0.000231277 -> 0.00023127731 recipients: + hyeshik.chang, lemburg, loewis, amaury.forgeotdarc, kcwu
2008-02-14 09:14:25	hyeshik.chang	set	spambayes_score: 0.000231277 -> 0.000231277 messageid: <1202980465.62.0.0714879420319.issue2066@psf.upfronthosting.co.za>
2008-02-14 09:14:24	hyeshik.chang	link	issue2066 messages
2008-02-14 09:14:21	hyeshik.chang	create