classification
Title: streaming struct unpacking
Type: enhancement Stage: committed/rejected
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: isoschiz, mark.dickinson, meador.inge, pconnell, pitrou, python-dev, rhettinger, serhiy.storchaka, skrah, terry.reedy
Priority: normal Keywords: patch

Created on 2013-04-20 20:33 by pitrou, last changed 2013-04-26 22:21 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
iter_unpack.patch pitrou, 2013-04-20 21:36 review
bench_unpack.py pitrou, 2013-04-20 21:54
struct_array_view.py serhiy.storchaka, 2013-04-21 14:56
Messages (19)
msg187455 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-20 20:33
For certain applications, you want to unpack repeatedly the same pattern. This came in issue17618 (base85 decoding), where you want to unpack a stream of bytes as 32-bit big-endian unsigned ints. The solution adopted in issue17618 patch (struct.Struct("!{}I")) is clearly suboptimal.

I would suggest something like a iter_unpack() function which would repeatedly yield tuples until the bytes object is over.
msg187457 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-20 20:39
(my initial intuition here was to use memoryview.cast() but it doesn't support non-native formats)
msg187465 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-20 21:36
Here is a patch (still lacking docs). Comments welcome.
msg187467 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-20 21:45
Perhaps we need not iter_unpack(), but a grouper (some sort of)?

def grouper(seq, size):
    for i in range(0, len(seq), size):
        yield seq[i: i + size]

unpack = struct.Struct('!I').unpack
for chunk in grouper(data, 4):
    word, = unpack(chunk)
    ...
msg187468 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-20 21:53
Well, according to a quick benchmark, iter_unpack() is 3x to 6x faster than the grouper() + unpack() recipe.
(it's also a bit more user-friendly)
msg187469 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-20 21:54
For the record, here is the benchmark script.
msg187470 - (view) Author: Martin Morrison (isoschiz) * Date: 2013-04-20 22:01
I like the idea of this. Two comments:

- I'm no expert on the C API, but in s_iter_unpack do you not need to check for failure of PyType_GenericAlloc before calling PyObject_GetBuffer?

- I'm not a fan of separate iter_ functions (and there seemed to be a general move away from them elsewhere in Python3; obviously here we have to maintain backwards compat though). Perhaps a boolean keyword "asiter" arg to the regular unpack() instead?
msg187471 - (view) Author: Martin Morrison (isoschiz) * Date: 2013-04-20 22:14
On 20 Apr 2013, at 23:01, Martin Morrison <report@bugs.python.org> wrote:
> - I'm not a fan of separate iter_ functions (and there seemed to be a general move away from them elsewhere in Python3; obviously here we have to maintain backwards compat though). Perhaps a boolean keyword "asiter" arg to the regular unpack() instead?

Thinking about this more, the functionality is probably too radically different to overload the same function, so I withdraw the suggestion.
msg187472 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-20 22:15
> - I'm no expert on the C API, but in s_iter_unpack do you not need to
> check for failure of PyType_GenericAlloc before calling
> PyObject_GetBuffer?

Yes, good catch.

> - I'm not a fan of separate iter_ functions (and there seemed to be a
> general move away from them elsewhere in Python3; obviously here we
> have to maintain backwards compat though). Perhaps a boolean keyword
> "asiter" arg to the regular unpack() instead?

We generally consider it bad API design when a parameter changes the
return *type* of the function. "iter_unpack" may not be terrific as a
name but it describes the semantics quite clearly (and it's not too
annoying to type).
msg187499 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-04-21 11:23
This seems like an attractive idea.  There's definitely a need for repeated unpacking with the same pattern, and I agree that putting the repetition into the pattern is suboptimal (not least from the point of view of caching structs).

One thing that feels a bit unorthogonal is that this is doing two things at once:  both allowing for repetition of a pattern, and also adding the lazy iteration.  I'd guess that there's also a use-case for allowing repetition but not returning an iterator;  but then that's easily covered by list(iter_unpack).

+1 from me.

Hmm;  the name.  'iterunpack'?  'iter_unpack'?  'unpack_stream'?  'unpack_all'?

Would we want something similar for packing, too?  I guess that's effectively covered by b''.join(s.pack(item) for item in ...).
msg187508 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-21 14:56
> Well, according to a quick benchmark, iter_unpack() is 3x to 6x faster than the grouper() + unpack() recipe.
> (it's also a bit more user-friendly)

Yes, It's mainly because a grouper written on Python. When it will be implemented in C, the difference will be less. This function will be useful beside struct. Note that in my patch for issue17618 struct.Struct("!{}I") is not used.

As for extending Struct, what you think about a more powerful feature? About a method which returns not an iterator, but an iterable and indexable sequence. Here is a sample Python implementation.
msg187509 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-21 14:59
> Yes, It's mainly because a grouper written on Python. When it will be
> implemented in C, the difference will be less. This function will be
> useful beside struct.

I'm not against adding useful C tools to itertools, but you may have to
convince Raymond ;)

> As for extending Struct, what you think about a more powerful feature?
> About a method which returns not an iterator, but an iterable and
> indexable sequence. Here is a sample Python implementation.

I'll take a look, but the question is how complex a C implementation
would be.
msg187554 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-22 11:46
> I'm not against adding useful C tools to itertools, but you may have to
convince Raymond ;)

C implementation speeds up the benchmark only 1.5x. Granting the fact that this idiom is used in stdlib less than two dozens times (without tests and iobench), I do not think more this is a worthful idea.

> I'll take a look, but the question is how complex a C implementation
would be.

Definitely it will be more complex than for iter_unpack. ;)
msg187602 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2013-04-23 01:32
This seems reasonable to me to.  So +1.

Small bikeshed on the name: I think 'unpack_iter' would be more
consistent with what is already there, e.g. 'unpack' and 'unpack_from'.
In fact, when experimenting with this patch I found myself typing
'unpack_iter' several times.
msg187609 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-23 06:07
> Small bikeshed on the name: I think 'unpack_iter' would be more
> consistent with what is already there, e.g. 'unpack' and 'unpack_from'.
> In fact, when experimenting with this patch I found myself typing
> 'unpack_iter' several times.

I thought so, but "unpack_iter" would mean we are unpacking an iterator,
while we are unpacking *as* an iterator (like iterkeys() and friends in
Python 2).
msg187682 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-04-24 03:55
iter_unpack() is closer to how we name other iter variants, so I think you ought to stick with that.  The other names are too creative (and hence less guessable).
msg187875 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-04-26 21:03
I think 'iter_unpack' is deceptive and wrong for the following reason. Up to now, 'ixyz' or 'iterxyz' or 'iter_xyz' has meant a version of 'xyz' that presents items one at a time rather than all at once in a collection object (usually in a list). Unpack returns a tuple, but the new function would *not* present the members of the tuple one at time. Instead, it would apply unpack multiple times, yielding multiple tuples. I would call the new thing 'unpack_all' or 'unpacker' (the latter works especially well for an iterator class). An unpacker is a person or machine that repeatedly unpacks. (I was once a bottle unpacker for a college summer job ;-).

struct.unpacker(fmt, buffer)
    Return an iterator that repeatedly unpacks successive buffer slices of size calcsize(fmt) into tuples according to the format string fmt. The buffer length must be an exact multiple of calcsize(fmt)). (? Not clear from text description. Add param to allow remainder?)
msg187880 - (view) Author: Roundup Robot (python-dev) Date: 2013-04-26 22:20
New changeset d232cff25bbd by Antoine Pitrou in branch 'default':
Issue #17804: New function ``struct.iter_unpack`` allows for streaming struct unpacking.
http://hg.python.org/cpython/rev/d232cff25bbd
msg187881 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-26 22:21
Ok, I don't want the bikeshedding to last too long, so I committed the patch with docs. Thanks everyone!
History
Date User Action Args
2013-04-26 22:21:03pitrousetstatus: open -> closed
resolution: fixed
messages: + msg187881

stage: patch review -> committed/rejected
2013-04-26 22:20:13python-devsetnosy: + python-dev
messages: + msg187880
2013-04-26 21:03:07terry.reedysetnosy: + terry.reedy
messages: + msg187875
2013-04-24 03:55:20rhettingersetnosy: + rhettinger
messages: + msg187682
2013-04-23 06:07:54pitrousetmessages: + msg187609
2013-04-23 01:32:37meador.ingesetmessages: + msg187602
stage: patch review
2013-04-22 11:46:58serhiy.storchakasetmessages: + msg187554
2013-04-21 14:59:07pitrousetmessages: + msg187509
2013-04-21 14:56:59serhiy.storchakasetfiles: + struct_array_view.py

messages: + msg187508
2013-04-21 11:23:56mark.dickinsonsetmessages: + msg187499
2013-04-20 22:24:35pconnellsetnosy: + pconnell
2013-04-20 22:15:32pitrousetmessages: + msg187472
2013-04-20 22:14:12isoschizsetmessages: + msg187471
2013-04-20 22:01:15isoschizsetmessages: + msg187470
2013-04-20 21:54:06pitrousetfiles: + bench_unpack.py

messages: + msg187469
2013-04-20 21:53:11pitrousetmessages: + msg187468
2013-04-20 21:45:25serhiy.storchakasetmessages: + msg187467
2013-04-20 21:37:44pitrousetnosy: + isoschiz
2013-04-20 21:36:53pitrousetfiles: + iter_unpack.patch
keywords: + patch
messages: + msg187465
2013-04-20 20:39:07pitrousetnosy: + skrah
messages: + msg187457
2013-04-20 20:33:23pitroucreate