classification
Title: Byte-code compilation uses excessive memory
Type: performance Stage:
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Zhiping.Deng, collinwinter, georg.brandl, goddard, loewis, pitrou, vstinner
Priority: low Keywords:

Created on 2009-03-24 19:48 by goddard, last changed 2013-10-13 17:53 by georg.brandl. This issue is now closed.

Messages (7)
msg84108 - (view) Author: Tom Goddard (goddard) Date: 2009-03-24 19:48
Bytecode compiling large Python files uses an unexpectedly large amount
of memory.  For example, compiling a file containing a list of 5 million
integers uses about 2 Gbytes of memory while the Python file size is
about 40 Mbytes.  The memory used is 50 times the file size.  The
resulting list in Python consumes about 400 Mbytes of memory, so
compiling the byte codes uses about 5 times the memory of the list
object.  Can the byte-code compilation can be made more memory efficient?

The application that creates simlilarly large Python files is a
molecular graphics program called UCSF Chimera that my lab develops.  It
writes session files which are Python code.  Sessions of reasonable size
for Chimera for a given amount of physical memory cannot be
byte-compiled without thrashing, crippling the interactivity of all
software running on the machine.

Here is Python code to produce the test file test.py containing a list
of 5 million integers:

print >>open('test.py','w'), 'x = ', repr(range(5000000))

I tried importing the test.py file with Python 2.5, 2.6.1 and 3.0.1 on
Mac OS 10.5.6.  In each case when the test.pyc file is not present the
python process as monitored by the unix "top" command took about 1.7 Gb
RSS and 2.2 Gb VSZ on a MacBook Pro which has 2 Gb of memory.
msg84110 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-03-24 20:19
It might be possible to make it more efficient. However, the primary
purpose of source code is to support hand-written code, and such code
should never run into such problems.

So lowering the priority. If you want this resolved, it might be best if
you provide a patch.
msg84116 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-24 22:09
Python uses inefficent memory structure for integers. You should use a 
3rd part library like numpy to manipulate large integer vectors.
msg84133 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-25 00:25
When compiling a source file to bytecode, Python first builds a syntax
tree in memory. It is very likely that the memory consumption you
observe is due to the size of the syntax tree. It is also unlikely that
someone else than you will want to modifying the parsing code to
accomodate such an extreme usage scenario :-)

For persistence of large data structures, I suggest using cPickle or a
similar mechanism. You can even embed the pickles in literal strings if
you still need your sessions to be Python source code:

>>> import cPickle
>>> f = open("test.py", "w")
>>> f.write("import cPickle\n")
>>> f.write("x = cPickle.loads(%s)" % repr(cPickle.dumps(range(5000000),
protocol=-1)))
>>> f.close()
>>> import test
>>> len(test.x)
5000000
msg84144 - (view) Author: Tom Goddard (goddard) Date: 2009-03-25 07:02
I agree that having such large Python code files is a rare circumstance
and optimizing the byte-code compiler for that should be a low priority.

Thanks for the cpickle suggestion.  The Chimera session file Python code
is mostly large nested dictionaries and sequences.  I tested cPickle and
repr() to embed data structures in the Python code getting rather larger
file size because the 8-bit characters became 4 bytes in the text file
string (e.g. "\xe8").  Using cPickle, and base64 encoding dropped the
file size by about a factor of 2.5 and cPickle, bzip2 or zlib
compression, and base64 dropped the size another factor of 2.  The big
win is that the byte code compilation used 150 Mbytes and 5 seconds
instead of 2 Gbytes and 15 minutes of thrashing for a 40 Mbyte python
file.  I think our reason for not using pickled data originally in the
session files was because we like users to be able to look at and edit
the session files in a text editor.  (This is research software where
such hacks sometimes are handy.)  But the especially large data
structures in the sessions can't reasonably be meddled with by users so
pickling should be fine.  Pickling adds about 15% to the session save
time, and reduces session opening by about the same amount.  Compression
slows the save down another 15% and probably is not worth the factor of
2 reduction in file size in our case.
msg84156 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-25 10:37
If you want editable data, you could use json instead of pickle. The
simplejson library has very fast encoding/decoding (faster than cPickle
according to its author).
msg199737 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-10-13 17:53
Closing, as without a specific issue to fix it is unlikely that this will change.
History
Date User Action Args
2013-10-13 17:53:07georg.brandlsetstatus: pending -> closed

nosy: + georg.brandl
messages: + msg199737

resolution: wont fix
2013-05-18 19:27:18serhiy.storchakasetstatus: open -> pending
2012-05-08 03:31:21Zhiping.Dengsetnosy: + Zhiping.Deng
2009-03-27 05:53:38collinwintersetnosy: + collinwinter
2009-03-25 10:38:00pitrousetmessages: + msg84156
2009-03-25 07:02:48goddardsetmessages: + msg84144
2009-03-25 00:26:00pitrousetnosy: + pitrou
messages: + msg84133
2009-03-24 22:09:23vstinnersetnosy: + vstinner
messages: + msg84116
2009-03-24 20:19:32loewissetpriority: low
nosy: + loewis
messages: + msg84110

2009-03-24 19:48:43goddardcreate