Created on 2009-03-24 19:48 by goddard, last changed 2013-05-18 19:27 by serhiy.storchaka.
|msg84108 - (view)||Author: Tom Goddard (goddard)||Date: 2009-03-24 19:48|
Bytecode compiling large Python files uses an unexpectedly large amount of memory. For example, compiling a file containing a list of 5 million integers uses about 2 Gbytes of memory while the Python file size is about 40 Mbytes. The memory used is 50 times the file size. The resulting list in Python consumes about 400 Mbytes of memory, so compiling the byte codes uses about 5 times the memory of the list object. Can the byte-code compilation can be made more memory efficient? The application that creates simlilarly large Python files is a molecular graphics program called UCSF Chimera that my lab develops. It writes session files which are Python code. Sessions of reasonable size for Chimera for a given amount of physical memory cannot be byte-compiled without thrashing, crippling the interactivity of all software running on the machine. Here is Python code to produce the test file test.py containing a list of 5 million integers: print >>open('test.py','w'), 'x = ', repr(range(5000000)) I tried importing the test.py file with Python 2.5, 2.6.1 and 3.0.1 on Mac OS 10.5.6. In each case when the test.pyc file is not present the python process as monitored by the unix "top" command took about 1.7 Gb RSS and 2.2 Gb VSZ on a MacBook Pro which has 2 Gb of memory.
|msg84110 - (view)||Author: Martin v. Löwis (loewis) *||Date: 2009-03-24 20:19|
It might be possible to make it more efficient. However, the primary purpose of source code is to support hand-written code, and such code should never run into such problems. So lowering the priority. If you want this resolved, it might be best if you provide a patch.
|msg84116 - (view)||Author: STINNER Victor (haypo) *||Date: 2009-03-24 22:09|
Python uses inefficent memory structure for integers. You should use a 3rd part library like numpy to manipulate large integer vectors.
|msg84133 - (view)||Author: Antoine Pitrou (pitrou) *||Date: 2009-03-25 00:25|
When compiling a source file to bytecode, Python first builds a syntax tree in memory. It is very likely that the memory consumption you observe is due to the size of the syntax tree. It is also unlikely that someone else than you will want to modifying the parsing code to accomodate such an extreme usage scenario :-) For persistence of large data structures, I suggest using cPickle or a similar mechanism. You can even embed the pickles in literal strings if you still need your sessions to be Python source code: >>> import cPickle >>> f = open("test.py", "w") >>> f.write("import cPickle\n") >>> f.write("x = cPickle.loads(%s)" % repr(cPickle.dumps(range(5000000), protocol=-1))) >>> f.close() >>> import test >>> len(test.x) 5000000
|msg84144 - (view)||Author: Tom Goddard (goddard)||Date: 2009-03-25 07:02|
I agree that having such large Python code files is a rare circumstance and optimizing the byte-code compiler for that should be a low priority. Thanks for the cpickle suggestion. The Chimera session file Python code is mostly large nested dictionaries and sequences. I tested cPickle and repr() to embed data structures in the Python code getting rather larger file size because the 8-bit characters became 4 bytes in the text file string (e.g. "\xe8"). Using cPickle, and base64 encoding dropped the file size by about a factor of 2.5 and cPickle, bzip2 or zlib compression, and base64 dropped the size another factor of 2. The big win is that the byte code compilation used 150 Mbytes and 5 seconds instead of 2 Gbytes and 15 minutes of thrashing for a 40 Mbyte python file. I think our reason for not using pickled data originally in the session files was because we like users to be able to look at and edit the session files in a text editor. (This is research software where such hacks sometimes are handy.) But the especially large data structures in the sessions can't reasonably be meddled with by users so pickling should be fine. Pickling adds about 15% to the session save time, and reduces session opening by about the same amount. Compression slows the save down another 15% and probably is not worth the factor of 2 reduction in file size in our case.
|msg84156 - (view)||Author: Antoine Pitrou (pitrou) *||Date: 2009-03-25 10:37|
If you want editable data, you could use json instead of pickle. The simplejson library has very fast encoding/decoding (faster than cPickle according to its author).
|2013-05-18 19:27:18||serhiy.storchaka||set||status: open -> pending|
|2009-03-25 10:38:00||pitrou||set||messages: + msg84156|
|2009-03-25 07:02:48||goddard||set||messages: + msg84144|
messages: + msg84133
messages: + msg84116
|2009-03-24 20:19:32||loewis||set||priority: low|
nosy: + loewis
messages: + msg84110