This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: struct module should support variable-length strings
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: AlexWaygood, Elizacat, cameron, ethan.furman, mark.dickinson, pitrou, rhettinger, serhiy.storchaka, twouters, yselivanov
Priority: low Keywords:

Created on 2017-01-19 18:24 by Elizacat, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (14)
msg285828 - (view) Author: Elizabeth Myers (Elizacat) * Date: 2017-01-19 18:24
There was some discussion on python-ideas about this, and I figured it would be more productive to bring it here since to me this appears to be a glaring omission.

The struct module has no capability to support variable-length strings; this includes null-terminated and Pascal-ish strings with a different integer datatype (usually in binary) specifying length.

This unfortunate omission makes the struct module extremely unwieldy to use in situations where you need to unpack a lot of variable-length strings, especially iteratively; see https://mail.python.org/pipermail/python-ideas/2017-January/044328.html for why. For zero-terminated strings, it is essentially impossible.

It's worth noting many modern protocols use variable-length strings, including DHCP.

I therefore propose the following extensions to the struct module (details can be bikeshedded over :P):

- Z (uppercase) format specifier (I did not invent this idea, see https://github.com/stendec/netstruct - although that uses $), which states the preceding whole-number datatype is the length of a string that follows.
- z (lowercase) format specifier, which specifies a null-terminated (also known as C style) string. An optional length parameter can be added to specify the maximum search length.

These two additions will make the struct module much more usable in a wider variety of contexts.
msg285830 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-19 18:38
Could you provide some examples of using these format specifiers? I suppose that due to limitations of the struct module the way in which they can be implemented would be not particularly useful for you.
msg285834 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2017-01-19 19:41
IMO, as one of the previous maintainers of the struct module, this feature request isn't compatible with the current design and purpose of the struct module. I agree that there's an important problem to solve (and one I've had to solve many times for various formats in consulting work); it's simply that the struct module isn't the right place to solve it.
msg285846 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2017-01-19 21:15
From Yury Selivanov:
-------------------
This is a neat idea, but this will only work for parsing framed
binary protocols.  For example, if you protocol prefixes all packets
with a length field, you can write an efficient read buffer and
use your proposal to decode all of message's fields in one shot.
Which is good.

Not all protocols use framing though.  For instance, your proposal
won't help to write Thrift or Postgres protocols parsers.

Overall, I'm not sure that this is worth the hassle.  With proposal:

   data, = struct.unpack('!H$', buf)
   buf = buf[2+len(data):]

with the current struct module:

   len, = struct.unpack('!H', buf)
   data = buf[2:2+len]
   buf = buf[2+len:]

Another thing: struct.calcsize won't work with structs that use
variable length fields.
msg285858 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017-01-19 23:18
Ethan, thanks for moving my reply on the list to here.  Also +1 to what Mark said.
msg285882 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-20 06:19
FWIW, I concur with Mark and Yuri that the feature request isn't compatible with the design and purpose of the struct module
msg285894 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-01-20 09:24
To add a bit to what Yury said, even framing isn't always compatible with this proposal.  For example, in dask/distributed, we first have a word for the number of frames, then one word per frame to indicate each frame's length, then the frame bodies.
msg285926 - (view) Author: Elizabeth Myers (Elizacat) * Date: 2017-01-20 23:15
Hi,

After discussing this on the python-ideas ML a bit more, this is actually a feature people want a great deal. It can't cover every use case, but to expand it further than this proposal and make it do so is way beyond the scope of this proposal.

It may not be completely useful for every protocol, but it is sufficiently useful for many people who have simpler use cases. There is no reason to prevent the addition of this feature other than what boils down to "well code will have to be written..."

I don't buy the argument that it's "outside the scope of the module" more than I think it's more "I don't like the idea of struct being used for non-fixed data." C structures support zero-terminated char arrays, and there is already a Pascal string type.

I didn't realise there'd be this much opposition to just adding two format specifiers... :/
msg285927 - (view) Author: Elizabeth Myers (Elizacat) * Date: 2017-01-20 23:24
Also, to add to the discussion:

* Rejecting this because "it doesn't cover every use case" is a red herring at best. If this can't cover your use case, odds are the struct module can *never* cover it. That is no reason to reject it alone; you would need something more heavyweight than the struct module anyway.

* If the module can't cover your use case with this feature, it can't cover it right now, so why obstruct it? It covers my use cases for it just fine.

* Not everyone needs something more heavyweight, or wants to import some bigger module just because they need variable-length strings.

* If the real goal is to discourage use of the struct module, too bad. People are actually using it in production and it serves its (rather small) purpose very well. Other people would like to use the module for their use cases, but presently cannot, and this proposal would help cover their particular cases.

* The fact that the netstruct module exists with this feature is proof enough there's demand; not to mention the discussion on the python-ideas ML shows that many people already would find this very useful. It's not like I'm proposing adding braces or some horrible huge proposal, I'm adding two format specifiers. *Two.*
msg285952 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2017-01-21 11:30
A couple of questions that haven't been brought up yet:

1. Do you have any thoughts on how alignment should behave for '@'-style structs containing variable-length strings? I suspect the easiest solution may be simply to disallow that combination, and only allow variable-length strings for "standard" struct types (those with a format string starting with one of "=", "<", ">", "!"), where alignment isn't an issue.

2. For the Struct object, what should the .size attribute give for a variable-length struct? (Or should accessing the attribute raise an exception?)

3. Any thoughts about how the internal representation of the Struct object would need to change? I guess you'd want to drop the "offset" field of the "formatcode" struct, and compute the offsets on the fly during packing and unpacking (or would you try to keep the offset for the non-variable-length cases?). You'd probably also want to find a way to encode the information about whether the struct is variable-length or not in the PyStructObject struct. A key requirement here is that there should be no substantial performance regression for packing / unpacking with structs that don't use the variable-length feature. It doesn't seem likely to me that getting rid of the precalculated offset would cause such a regression, but it's something that should be checked.
msg285954 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-21 12:22
If you want to add the support variable-size fields, this is incompatible with the struct module design.

If you want to add the support of variable-length strings inside fixed-size fields (as with the 'p' format unit), I think this case is not enough common.

After getting so much negative responses from core developers well-versed in internals of the struct module, I think this issue should be closed.
msg286060 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-23 08:33
FWIW, the existence of netstruct https://github.com/stendec/netstruct does at least show that a coherent proposal is possible (the docs provide an API, motivation and examples).

One question is whether this could be implemented in the stdlib struct module without major brain surgery to the existing code which is organized around an initial prepare_s() step whose output is a PyStructObject which includes a fixed s_size field.  The various parsing routines all depend on the s_size field.  Where this proposal would require too much of a total rewrite is likely only answerable by someone building a patch with tests and real-world examples.

Another question is whether supporting variable length input would be better served by providing an alternative API that is better designed for it.  For example, unpack(fmt, input_stream or iterator, limit=None) which would consume as many bytes as needed from the input stream with an optional limit.  This would be nicer than having to figure-out in advance how to extract an input string of the appropriate size.

A last question is whether this should remain outside the standard library.  With PyPI becoming so rich and easy to access, we're often deciding that code is better-off outside the standard library where it can flourish in a more fluid environment.
msg286298 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-26 08:44
s/Where/Whether/
msg408917 - (view) Author: Alex Waygood (AlexWaygood) * (Python triager) Date: 2021-12-19 17:22
I am closing this issue as "rejected", given the consensus that writing a patch could be a major undertaking, the lack of such a patch, and the fact that there has been no activity on the issue thread (or the python-ideas mailing list) for nearly 5 years.
History
Date User Action Args
2022-04-11 14:58:42adminsetgithub: 73514
2021-12-19 17:22:43AlexWaygoodsetstatus: open -> closed

nosy: + AlexWaygood
messages: + msg408917

resolution: rejected
stage: needs patch -> resolved
2017-01-30 09:11:25mark.dickinsonsetstage: needs patch
2017-01-26 08:44:09rhettingersetpriority: normal -> low

messages: + msg286298
2017-01-24 17:07:26rhettingersetnosy: + twouters
2017-01-23 08:33:53rhettingersetmessages: + msg286060
2017-01-21 12:22:17serhiy.storchakasetmessages: + msg285954
2017-01-21 11:30:37mark.dickinsonsetmessages: + msg285952
2017-01-21 02:29:01cameronsetnosy: + cameron
2017-01-20 23:24:02Elizacatsetmessages: + msg285927
2017-01-20 23:15:39Elizacatsetmessages: + msg285926
2017-01-20 09:24:40pitrousetnosy: + pitrou
messages: + msg285894
2017-01-20 06:19:21rhettingersetnosy: + rhettinger
messages: + msg285882
2017-01-19 23:18:06yselivanovsetnosy: + yselivanov
messages: + msg285858
2017-01-19 21:15:08ethan.furmansetmessages: + msg285846
2017-01-19 19:41:53mark.dickinsonsetnosy: + mark.dickinson
messages: + msg285834
2017-01-19 19:04:59ethan.furmansetnosy: + ethan.furman
2017-01-19 18:38:34serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg285830
2017-01-19 18:24:32Elizacatcreate