Message 158456 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Robert.Elsner
Recipients	Robert.Elsner, mark.dickinson, pitrou, serhiy.storchaka
Date	2012-04-16.14:21:30
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<4F8C2AE6.7070309@googlemail.com>
In-reply-to	<1334584688.3426.3.camel@localhost.localdomain>

Content
Well I stumbled across this leak while reading big files. And what is the point of having a fast C-level unpack when it can not be used with big files? I am not adverse to the idea of caching the format string but if the cache grows beyond a reasonable size, it should be freed. And "reasonable" is not the number of objects contained but the amount of memory it consumes. And caching an arbitrary amount of data (8GB here) is a waste of memory. And reading the Python docs, struct.Struct.unpack which is _not_ affected from the memory leak is supposed to be faster. Quote: > class struct.Struct(format) > > Return a new Struct object which writes and reads binary data according to the format string format. Creating a Struct object once and calling its methods is more efficient than calling the struct functions with the same format since the format string only needs to be compiled once. Caching in case of struct.Struct is straightforward: As long as the object exists, the format string is cached and if the object is no longer accessible, its memory gets freed - including the cached format string. The problem is with the "magic" creation of struct.Struct objects by struct.unpack that linger around even after all associated variables are no longer in scope. Using for example fixed 1MB buffer to read files (regardless of size) incurs a huge performance penalty. Reading everything at once into memory using struct.unpack (or with the same speed struct.Struct.unpack) is the fastest way. Approximately 40% faster than array.fromfile and and 70% faster than numpy.fromfile. I read some unspecified report about a possible memory leak in struct.unpack but the author did not investigate further. It took me quite some time to figure out what exactly happens. So there should be at least a warning about this (ugly) behavior when reading big files for speed and a pointer to a quick workaround (using struct.Struct.unpack). cheers Am 16.04.2012 15:59, schrieb Antoine Pitrou: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Perhaps the best quick fix would be to only cache small >> PyStructObjects, for some value of 'small'. (Total size < a few >> hundred bytes, perhaps.) > > Or perhaps not care at all? Is there a use case for huge repeat counts? > (limiting cacheability could decrease performance in existing > applications) > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue14596> > _______________________________________

Well I stumbled across this leak while reading big files. And what is
the point of having a fast C-level unpack when it can not be used with
big files?
I am not adverse to the idea of caching the format string but if the
cache grows beyond a reasonable size, it should be freed. And
"reasonable" is not the number of objects contained but the amount of
memory it consumes. And caching an arbitrary amount of data (8GB here)
is a waste of memory.

And reading the Python docs, struct.Struct.unpack which is _not_
affected from the memory leak is supposed to be faster. Quote:

> class struct.Struct(format)
> 
> Return a new Struct object which writes and reads binary data according to the format string format. Creating a Struct object once and calling its methods is more efficient than calling the struct functions with the same format since the format string only needs to be compiled once.

Caching in case of struct.Struct is straightforward: As long as the
object exists, the format string is cached and if the object is no
longer accessible, its memory gets freed - including the cached format
string. The problem is with the "magic" creation of struct.Struct
objects by struct.unpack that linger around even after all associated
variables are no longer in scope.

Using for example fixed 1MB buffer to read files (regardless of size)
incurs a huge performance penalty. Reading everything at once into
memory using struct.unpack (or with the same speed struct.Struct.unpack)
is the fastest way. Approximately 40% faster than array.fromfile and and
70% faster than numpy.fromfile.

I read some unspecified report about a possible memory leak in
struct.unpack but the author did not investigate further. It took me
quite some time to figure out what exactly happens. So there should be
at least a warning about this (ugly) behavior when reading big files for
speed and a pointer to a quick workaround (using struct.Struct.unpack).

cheers

Am 16.04.2012 15:59, schrieb Antoine Pitrou:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> Perhaps the best quick fix would be to only cache small
>> PyStructObjects, for some value of 'small'.  (Total size < a few
>> hundred bytes, perhaps.)
> 
> Or perhaps not care at all? Is there a use case for huge repeat counts?
> (limiting cacheability could decrease performance in existing
> applications)
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue14596>
> _______________________________________

History
Date	User	Action	Args
2012-04-16 14:21:31	Robert.Elsner	set	recipients: + Robert.Elsner, mark.dickinson, pitrou, serhiy.storchaka
2012-04-16 14:21:30	Robert.Elsner	link	issue14596 messages
2012-04-16 14:21:30	Robert.Elsner	create