Message 96221 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	asnakelover, brian.curtin, pitrou
Date	2009-12-10.23:03:20
SpamBayes Score	8.018985e-09
Marked as misclassified	No
Message-id	<1260486235.3414.6.camel@localhost>
In-reply-to	<1260485677.82.0.178769470321.issue7471@psf.upfronthosting.co.za>

Content
> The gz in question is 17mb compressed and 247mb uncompressed. Calling > zcat the python process uses between 250 and 260 mb with the whole > string in memory using zcat as a fork. Numbers for the gzip module > aren't obtainable except for readline(), which doesn't use much memory > but is very slow. Other methods thrash the machine to death. > > The machine has 300mb free RAM from a total of 1024mb. That would be the explanation. Reading the whole file at once and then doing splitlines() on the result consumes twice the memory, since a list of lines must be constructed while the original data is still around. If you had more than 600 MB free RAM the splitlines() solution would probably be adequate :-) Doing repeated calls to splitlines() on chunks of limited size (say 1MB) would probably be fast enough without using too much memory. It would be a bit less trivial to implement though, and it seems you are ok with the subprocess solution.

> The gz in question is 17mb compressed and 247mb uncompressed. Calling
> zcat the python process uses between 250 and 260 mb with the whole
> string in memory using zcat as a fork. Numbers for the gzip module
> aren't obtainable except for readline(), which doesn't use much memory
> but is very slow. Other methods thrash the machine to death.
> 
> The machine has 300mb free RAM from a total of 1024mb.

That would be the explanation. Reading the whole file at once and then
doing splitlines() on the result consumes twice the memory, since a list
of lines must be constructed while the original data is still around. If
you had more than 600 MB free RAM the splitlines() solution would
probably be adequate :-)

Doing repeated calls to splitlines() on chunks of limited size (say 1MB)
would probably be fast enough without using too much memory. It would be
a bit less trivial to implement though, and it seems you are ok with the
subprocess solution.

History
Date	User	Action	Args
2009-12-10 23:03:22	pitrou	set	recipients: + pitrou, brian.curtin, asnakelover
2009-12-10 23:03:21	pitrou	link	issue7471 messages
2009-12-10 23:03:20	pitrou	create