This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Compress the marshalled data in PYC files
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, FFY00, barry, christian.heimes, georg.brandl, gvanrossum, lemburg, pitrou, rhettinger, scoder, serhiy.storchaka, tim.peters
Priority: normal Keywords:

Created on 2014-11-04 05:00 by rhettinger, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
compress_pyc.py rhettinger, 2014-11-04 05:00 Estimate space savings potential in PYC files
Messages (14)
msg230576 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-11-04 05:00
Save space and reduce I/O time (reading and writing) by compressing the marshaled code in  files.

In my code tree for Python 3, there was a nice space savings 19M to 7M.  Here's some of the output from my test:

    8792 ->     4629 ./Tools/scripts/__pycache__/reindent.cpython-35.pyc
    1660 ->     1063 ./Tools/scripts/__pycache__/rgrep.cpython-35.pyc
    1995 ->     1129 ./Tools/scripts/__pycache__/run_tests.cpython-35.pyc
    1439 ->      973 ./Tools/scripts/__pycache__/serve.cpython-35.pyc
     727 ->      498 ./Tools/scripts/__pycache__/suff.cpython-35.pyc
    3240 ->     1808 ./Tools/scripts/__pycache__/svneol.cpython-35.pyc
   74866 ->    23611 ./Tools/scripts/__pycache__/texi2html.cpython-35.pyc
    5562 ->     2870 ./Tools/scripts/__pycache__/treesync.cpython-35.pyc
    1492 ->      970 ./Tools/scripts/__pycache__/untabify.cpython-35.pyc
    1414 ->      891 ./Tools/scripts/__pycache__/which.cpython-35.pyc
19627963 ->  6976410 Total

I haven't measured it yet, but I believe this will improve Python's start-up time (because fewer bytes get transferred from disk).
msg230581 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-11-04 06:16
Looking into this further, I suspect that the cleanest way to implement this would be to add a zlib compression and decompression using to the marshal.c (bumping the version number to 5).
msg230600 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-04 09:41
This is similar to the idea of loading the stdlib from a zip file (but less intrusive and more debugging-friendly). The time savings will depend on whether the filesystem cache is cold or hot. In the latter case, my intuition is that decompression will slow things down a bit :-)

Quick decompression benchmark on a popular stdlib module, and a fast CPU:

$ ./python -m timeit -s "import zlib; data = zlib.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read())" "zlib.decompress(data)"
10000 loops, best of 3: 180 usec per loop
msg230607 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-11-04 10:25
On 04.11.2014 10:41, Antoine Pitrou wrote:
> 
> Antoine Pitrou added the comment:
> 
> This is similar to the idea of loading the stdlib from a zip file (but less intrusive and more debugging-friendly). The time savings will depend on whether the filesystem cache is cold or hot. In the latter case, my intuition is that decompression will slow things down a bit :-)
> 
> Quick decompression benchmark on a popular stdlib module, and a fast CPU:
> 
> $ ./python -m timeit -s "import zlib; data = zlib.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read())" "zlib.decompress(data)"
> 10000 loops, best of 3: 180 usec per loop

zlib is rather slow when it comes to decompression. Something like
snappy or lz4 could work out, though:

https://code.google.com/p/snappy/
https://code.google.com/p/lz4/

Those were designed to be fast on decompression.
msg230610 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-04 10:39
Ok, comparison between zlib/snappy/lz4:

$ python3.4 -m timeit -s "import zlib; data = zlib.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read()); print(len(data))" "zlib.decompress(data)"
10000 loops, best of 3: 181 usec per loop

$ python3.4 -m timeit -s "import snappy; data = snappy.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read()); print(len(data))" "snappy.decompress(data)"
10000 loops, best of 3: 35 usec per loop

$ python3.4 -m timeit -s "import lz4; data = lz4.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read()); print(len(data))" "lz4.decompress(data)"
10000 loops, best of 3: 21.3 usec per loop

Compressed sizes for threading.cpython-35.pyc (the file used above):
- zlib: 14009 bytes
- snappy: 20573 bytes
- lz4: 21038 bytes
- uncompressed: 38973 bytes

Packages used:
https://pypi.python.org/pypi/lz4/0.7.0
https://pypi.python.org/pypi/python-snappy/0.5
msg230611 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-04 10:42
lz4 also has a "high compression" mode which improves the compression ratio (-> 17091 bytes compressed), for a similar decompression speed.
msg230615 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-11-04 10:49
Both lz4 and snappy are BSD-licensed, but snappy is written in C++.
msg230631 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2014-11-04 15:35
Just FYI, there can easily be added into importlib since it works through marshal's API to unmarshal the module's data. There is also two startup benchmarks in the benchmark suite to help measure possible performance gains/losses which should also ferret out if cache warmth will play a significant role in the performance impact.
msg230730 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-06 09:15
FWIW, I personally doubt this would actually reduce startup time. Disk I/O cost is in the first access, not in the transfer size (unless we're talking hundreds of megabytes). But in any case, someone interested has to do measurements :-)
msg230756 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-11-06 18:59
FWIW, LZ4HC compression sounds like an obvious choice for write-once-read-many data like .pyc files to me. Blosc shows that you can achieve a pretty major performance improvement just by stuffing more data into less space (although it does it for RAM and CPU cache, not disk). And even if it ends up not being substantially faster for the specific case of .pyc files, there is really no reason why they should take more space on disk than necessary, so it's a sure win in any case.
msg230838 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-11-08 06:32
> there is really no reason why they should take more space on disk
> than necessary, so it's a sure win in any case.

That is a nice summary.

> FWIW, LZ4HC compression sounds like an obvious choice for
> write-once-read-many data like .pyc files to me.

+1
msg230840 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-08 09:28
Compressing pyc files one by one wouldn't save much space because disk space is allocated by blocks (up to 32 KiB on FAT32). If the size of pyc file is less than block size, we will not gain anything. ZIP file has advantage due more compact packing of files. In additional it can has less access time due to less fragmentation. Unfortunately it doesn't support the LZ4 compression, but we can store LZ4 compressed files in ZIP file without additional compression.

Uncompressed TAR file has same advantages but needs longer initialization time (for building the index).
msg230842 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-11-08 10:08
On 08.11.2014 10:28, Serhiy Storchaka wrote:
> Compressing pyc files one by one wouldn't save much space because disk space is allocated by blocks (up to 32 KiB on FAT32). If the size of pyc file is less than block size, we will not gain anything. ZIP file has advantage due more compact packing of files. In additional it can has less access time due to less fragmentation. Unfortunately it doesn't support the LZ4 compression, but we can store LZ4 compressed files in ZIP file without additional compression.
> 
> Uncompressed TAR file has same advantages but needs longer initialization time (for building the index).

The aim is to reduce file load time, not really to save disk space.
By having less data to read from the disk, it may be possible
to achieve a small startup speedup.

However, you're right in that using a single archive with many PYC files
would be more efficient, since it lowers the number of stat() calls.
The trick to store LZ4 compressed data in a ZIP file would enable this.

BTW: We could add optional LZ4 compression to the marshal format to
make all this work transparently and without having to change the
import mechanism itself:

We'd just need to add a new flag or type code indicating that the rest
of the stream is LZ4 compressed. The PYC writer could then enable this
flag or type code per default (or perhaps enabled via some env var od
command line flag) and everything would then just work with both
LZ4 compressed byte code as well as non-compressed byte code.
msg404622 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-10-21 16:53
The space savings are nice, but I doubt that it will matter for startup time -- startup is most relevant in situations where it's *hot* (e.g. a shell script that repeatedly calls out to utilities written in Python).
History
Date User Action Args
2022-04-11 14:58:09adminsetgithub: 66978
2021-10-26 04:56:26barrysetnosy: + barry
2021-10-23 16:55:49FFY00setnosy: + FFY00
2021-10-21 16:53:10gvanrossumsetnosy: + gvanrossum
messages: + msg404622
2020-03-18 18:02:23brett.cannonsetnosy: - brett.cannon
2014-11-08 10:08:17lemburgsetmessages: + msg230842
2014-11-08 09:28:55serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg230840
2014-11-08 06:32:04rhettingersetmessages: + msg230838
2014-11-06 18:59:59scodersetnosy: + scoder
messages: + msg230756
2014-11-06 09:15:20pitrousetmessages: + msg230730
2014-11-04 15:59:21christian.heimessetnosy: + christian.heimes
2014-11-04 15:35:47brett.cannonsetmessages: + msg230631
2014-11-04 11:58:01Arfreversetnosy: + Arfrever
2014-11-04 10:49:49georg.brandlsetnosy: + georg.brandl
messages: + msg230615
2014-11-04 10:42:44pitrousetmessages: + msg230611
2014-11-04 10:39:59pitrousetmessages: + msg230610
2014-11-04 10:25:08lemburgsetnosy: + lemburg
messages: + msg230607
2014-11-04 09:41:05pitrousetnosy: + tim.peters
messages: + msg230600
2014-11-04 06:16:55rhettingersetmessages: - msg230580
2014-11-04 06:16:46rhettingersetmessages: + msg230581
2014-11-04 06:09:19rhettingersetnosy: + brett.cannon, pitrou
messages: + msg230580
components: + Interpreter Core
2014-11-04 05:00:45rhettingercreate