Issue 22789: Compress the marshalled data in PYC files

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66978

classification

Title:	Compress the marshalled data in PYC files
Type:	enhancement	Stage:	needs patch
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, FFY00, barry, christian.heimes, georg.brandl, gvanrossum, lemburg, pitrou, rhettinger, scoder, serhiy.storchaka, tim.peters
Priority:	normal	Keywords:

Created on 2014-11-04 05:00 by rhettinger, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
compress_pyc.py	rhettinger, 2014-11-04 05:00	Estimate space savings potential in PYC files

Messages (14)
msg230576 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-11-04 05:00
Save space and reduce I/O time (reading and writing) by compressing the marshaled code in files. In my code tree for Python 3, there was a nice space savings 19M to 7M. Here's some of the output from my test: 8792 -> 4629 ./Tools/scripts/__pycache__/reindent.cpython-35.pyc 1660 -> 1063 ./Tools/scripts/__pycache__/rgrep.cpython-35.pyc 1995 -> 1129 ./Tools/scripts/__pycache__/run_tests.cpython-35.pyc 1439 -> 973 ./Tools/scripts/__pycache__/serve.cpython-35.pyc 727 -> 498 ./Tools/scripts/__pycache__/suff.cpython-35.pyc 3240 -> 1808 ./Tools/scripts/__pycache__/svneol.cpython-35.pyc 74866 -> 23611 ./Tools/scripts/__pycache__/texi2html.cpython-35.pyc 5562 -> 2870 ./Tools/scripts/__pycache__/treesync.cpython-35.pyc 1492 -> 970 ./Tools/scripts/__pycache__/untabify.cpython-35.pyc 1414 -> 891 ./Tools/scripts/__pycache__/which.cpython-35.pyc 19627963 -> 6976410 Total I haven't measured it yet, but I believe this will improve Python's start-up time (because fewer bytes get transferred from disk).
msg230581 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-11-04 06:16
Looking into this further, I suspect that the cleanest way to implement this would be to add a zlib compression and decompression using to the marshal.c (bumping the version number to 5).
msg230600 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-04 09:41
This is similar to the idea of loading the stdlib from a zip file (but less intrusive and more debugging-friendly). The time savings will depend on whether the filesystem cache is cold or hot. In the latter case, my intuition is that decompression will slow things down a bit :-) Quick decompression benchmark on a popular stdlib module, and a fast CPU: $ ./python -m timeit -s "import zlib; data = zlib.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read())" "zlib.decompress(data)" 10000 loops, best of 3: 180 usec per loop
msg230607 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-11-04 10:25
On 04.11.2014 10:41, Antoine Pitrou wrote: > > Antoine Pitrou added the comment: > > This is similar to the idea of loading the stdlib from a zip file (but less intrusive and more debugging-friendly). The time savings will depend on whether the filesystem cache is cold or hot. In the latter case, my intuition is that decompression will slow things down a bit :-) > > Quick decompression benchmark on a popular stdlib module, and a fast CPU: > > $ ./python -m timeit -s "import zlib; data = zlib.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read())" "zlib.decompress(data)" > 10000 loops, best of 3: 180 usec per loop zlib is rather slow when it comes to decompression. Something like snappy or lz4 could work out, though: https://code.google.com/p/snappy/ https://code.google.com/p/lz4/ Those were designed to be fast on decompression.
msg230610 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-04 10:39
Ok, comparison between zlib/snappy/lz4: $ python3.4 -m timeit -s "import zlib; data = zlib.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read()); print(len(data))" "zlib.decompress(data)" 10000 loops, best of 3: 181 usec per loop $ python3.4 -m timeit -s "import snappy; data = snappy.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read()); print(len(data))" "snappy.decompress(data)" 10000 loops, best of 3: 35 usec per loop $ python3.4 -m timeit -s "import lz4; data = lz4.compress(open('Lib/__pycache__/threading.cpython-35.pyc', 'rb').read()); print(len(data))" "lz4.decompress(data)" 10000 loops, best of 3: 21.3 usec per loop Compressed sizes for threading.cpython-35.pyc (the file used above): - zlib: 14009 bytes - snappy: 20573 bytes - lz4: 21038 bytes - uncompressed: 38973 bytes Packages used: https://pypi.python.org/pypi/lz4/0.7.0 https://pypi.python.org/pypi/python-snappy/0.5
msg230611 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-04 10:42
lz4 also has a "high compression" mode which improves the compression ratio (-> 17091 bytes compressed), for a similar decompression speed.
msg230615 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2014-11-04 10:49
Both lz4 and snappy are BSD-licensed, but snappy is written in C++.
msg230631 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2014-11-04 15:35
Just FYI, there can easily be added into importlib since it works through marshal's API to unmarshal the module's data. There is also two startup benchmarks in the benchmark suite to help measure possible performance gains/losses which should also ferret out if cache warmth will play a significant role in the performance impact.
msg230730 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-06 09:15
FWIW, I personally doubt this would actually reduce startup time. Disk I/O cost is in the first access, not in the transfer size (unless we're talking hundreds of megabytes). But in any case, someone interested has to do measurements :-)
msg230756 - (view)	Author: Stefan Behnel (scoder) *	Date: 2014-11-06 18:59
FWIW, LZ4HC compression sounds like an obvious choice for write-once-read-many data like .pyc files to me. Blosc shows that you can achieve a pretty major performance improvement just by stuffing more data into less space (although it does it for RAM and CPU cache, not disk). And even if it ends up not being substantially faster for the specific case of .pyc files, there is really no reason why they should take more space on disk than necessary, so it's a sure win in any case.
msg230838 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-11-08 06:32
> there is really no reason why they should take more space on disk > than necessary, so it's a sure win in any case. That is a nice summary. > FWIW, LZ4HC compression sounds like an obvious choice for > write-once-read-many data like .pyc files to me. +1
msg230840 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-08 09:28
Compressing pyc files one by one wouldn't save much space because disk space is allocated by blocks (up to 32 KiB on FAT32). If the size of pyc file is less than block size, we will not gain anything. ZIP file has advantage due more compact packing of files. In additional it can has less access time due to less fragmentation. Unfortunately it doesn't support the LZ4 compression, but we can store LZ4 compressed files in ZIP file without additional compression. Uncompressed TAR file has same advantages but needs longer initialization time (for building the index).
msg230842 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-11-08 10:08
On 08.11.2014 10:28, Serhiy Storchaka wrote: > Compressing pyc files one by one wouldn't save much space because disk space is allocated by blocks (up to 32 KiB on FAT32). If the size of pyc file is less than block size, we will not gain anything. ZIP file has advantage due more compact packing of files. In additional it can has less access time due to less fragmentation. Unfortunately it doesn't support the LZ4 compression, but we can store LZ4 compressed files in ZIP file without additional compression. > > Uncompressed TAR file has same advantages but needs longer initialization time (for building the index). The aim is to reduce file load time, not really to save disk space. By having less data to read from the disk, it may be possible to achieve a small startup speedup. However, you're right in that using a single archive with many PYC files would be more efficient, since it lowers the number of stat() calls. The trick to store LZ4 compressed data in a ZIP file would enable this. BTW: We could add optional LZ4 compression to the marshal format to make all this work transparently and without having to change the import mechanism itself: We'd just need to add a new flag or type code indicating that the rest of the stream is LZ4 compressed. The PYC writer could then enable this flag or type code per default (or perhaps enabled via some env var od command line flag) and everything would then just work with both LZ4 compressed byte code as well as non-compressed byte code.
msg404622 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-10-21 16:53
The space savings are nice, but I doubt that it will matter for startup time -- startup is most relevant in situations where it's hot (e.g. a shell script that repeatedly calls out to utilities written in Python).

History
Date	User	Action	Args
2022-04-11 14:58:09	admin	set	github: 66978
2021-10-26 04:56:26	barry	set	nosy: + barry
2021-10-23 16:55:49	FFY00	set	nosy: + FFY00
2021-10-21 16:53:10	gvanrossum	set	nosy: + gvanrossum messages: + msg404622
2020-03-18 18:02:23	brett.cannon	set	nosy: - brett.cannon
2014-11-08 10:08:17	lemburg	set	messages: + msg230842
2014-11-08 09:28:55	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg230840
2014-11-08 06:32:04	rhettinger	set	messages: + msg230838
2014-11-06 18:59:59	scoder	set	nosy: + scoder messages: + msg230756
2014-11-06 09:15:20	pitrou	set	messages: + msg230730
2014-11-04 15:59:21	christian.heimes	set	nosy: + christian.heimes
2014-11-04 15:35:47	brett.cannon	set	messages: + msg230631
2014-11-04 11:58:01	Arfrever	set	nosy: + Arfrever
2014-11-04 10:49:49	georg.brandl	set	nosy: + georg.brandl messages: + msg230615
2014-11-04 10:42:44	pitrou	set	messages: + msg230611
2014-11-04 10:39:59	pitrou	set	messages: + msg230610
2014-11-04 10:25:08	lemburg	set	nosy: + lemburg messages: + msg230607
2014-11-04 09:41:05	pitrou	set	nosy: + tim.peters messages: + msg230600
2014-11-04 06:16:55	rhettinger	set	messages: - msg230580
2014-11-04 06:16:46	rhettinger	set	messages: + msg230581
2014-11-04 06:09:19	rhettinger	set	nosy: + brett.cannon, pitrou messages: + msg230580 components: + Interpreter Core
2014-11-04 05:00:45	rhettinger	create