This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: implement PEP 3118 struct changes
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, MrJean1, ajaksu2, barry, belopolsky, benjamin.peterson, inducer, mark.dickinson, martin.panter, meador.inge, ncoghlan, noufal, paulehoffman, pitrou, pv, skrah, teoliphant
Priority: high Keywords: patch

Created on 2008-06-17 22:30 by benjamin.peterson, last changed 2022-04-11 14:56 by admin.

Files
File name Uploaded Description Edit
pep-3118.patch meador.inge, 2010-02-17 03:30
struct-string.py3k.patch meador.inge, 2010-05-18 04:07 Patch for 'T{}' syntax and multiple byte order specifiers.
struct-string.py3k.2.patch meador.inge, 2010-05-20 13:50 Patch with fixed assertions
struct-string.py3k.3.patch meador.inge, 2011-01-07 03:59 Patch for 'T{}' against py3k r87813 review
grammar.y skrah, 2016-04-13 10:20
Messages (58)
msg68347 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-17 22:30
It seems the new modifiers to the struct.unpack/pack module that were
proposed in PEP 3118 haven't been implemented yet.
msg68507 - (view) Author: Jean Brouwers (MrJean1) Date: 2008-06-21 15:59
If the struct changes are made, add also 2 formats for C types ssize_t and 
size_t, perhaps 'z' resp. 'Z'.  In particular since on platforms 
sizeof(size_t) != sizeof(long).
msg71313 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-08-18 03:00
It's looking pessimistic that this is going to make it by beta 3.  If
they can't get in by then, it's too late.
msg71316 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-18 09:36
Let's retarget it to 3.1 then. It's a new feature, not a behaviour
change or a deprecation, so adding it to 3.0 isn't a necessity.
msg71338 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-08-18 15:02
Actually, this may be a requirement of #2394; PEP 3118 states that
memoryview.tolist would use the struct module to do the unpacking.
msg71342 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-18 15:51
> Actually, this may be a requirement of #2394; PEP 3118 states that
> memoryview.tolist would use the struct module to do the unpacking.

:-(
However, we don't have any examples of the buffer API / memoryview
object working with something else than 1-dimensional contiguous char
arrays (e.g. bytearray). Therefore, I suggest that Python 3.0 provide
official support only for 1-dimensional contiguous char arrays. Then
tolist() will be easy to implement even without using the struct module
(just a list of integers, if I understand the functionality).
msg71882 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2008-08-24 21:38
This can be re-targeted to 3.1 as described.
msg87921 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-05-16 20:33
Travis,
Do you think you can contribute for this to actually land in 3.2? Having
a critical issue slipping from 3.0 to 3.3 would be bad...

Does this supersede issue 2395 or is this a subset of that one.?
msg99296 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-02-13 01:28
Is anyone working on implementing these new struct modifiers?  If not, then I would love to take a shot at it.
msg99297 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-02-13 01:35
2010/2/12 Meador Inge <report@bugs.python.org>:
>
> Meador Inge <meadori@gmail.com> added the comment:
>
> Is anyone working on implementing these new struct modifiers?  If not, then I would love to take a shot at it.

Not to my knowledge.
msg99309 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2010-02-13 06:06
On Feb 12, 2010, at 7:29 PM, Meador Inge wrote:

>
> Meador Inge <meadori@gmail.com> added the comment:
>
> Is anyone working on implementing these new struct modifiers?  If  
> not, then I would love to take a shot at it.

That would be great.

-Travis
msg99312 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-13 11:07
Some of the proposed struct module additions look far from straightforward;  I find that section of the PEP significantly lacking in details and motivation.

"Unpacking a long-double will return a decimal object or a ctypes long-double."

Returning a Decimal object here doesn't make a lot of sense, since Decimal objects aren't generally compatible with floats.  And ctypes long double objects don't seem to exist, as far as I can tell.  It might be better not to add this code.

Another bit that's not clear to me:  how is unpacking an object pointer expected to work, and how would it typically be used?  What if the unpacked pointer no longer points to a valid Python object?  How would this work in other Python implementations?

For the 'X{}' format (pointer to a function), is this supposed to mean a Python function or a C function?

What's a 'specific pointer'?
msg99313 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-13 12:01
Whoops.  ctypes does have long double, of course.  Apologies.
msg99460 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-02-17 03:30
Hi All,

On Sat, Feb 13, 2010 at 5:07 AM, Mark Dickinson <report@bugs.python.org>wrote:

>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
> Some of the proposed struct module additions look far from straightforward;
>  I find that section of the PEP significantly lacking in details and
> motivation.
>

I agree.

> "Unpacking a long-double will return a decimal object or a ctypes
> long-double."
>
> Returning a Decimal object here doesn't make a lot of sense, since Decimal
> objects aren't generally compatible with floats.  And ctypes long double
> objects don't seem to exist, as far as I can tell.  It might be better not
> to add this code.

And under what conditions would a ctype long double be used vs. a Decimal
object.

>

Another bit that's not clear to me:  how is unpacking an object pointer
> expected to work, and how would it typically be used?  What if the unpacked
> pointer no longer points to a valid Python object?  How would this work in
> other Python implementations?
>

I guess if an object associated with the packed address does not exist, then
you would unpack None (?).  This is especially a problem if the struct-sting
is being sent over the wire to another machine.

> For the 'X{}' format (pointer to a function), is this supposed to mean a
> Python function or a C function?
>

I read that as a Python function.  However, I am not completely sure how the
prototype would be enforced when unpacking.  I am also wondering, though,
how the signatures on pointers-to-functions are specified?  Are
the arguments and return type full struct strings as well?

> What's a 'specific pointer'?

I think this means a pointer to a specific type, e.g. '&d' is a pointer to a
double. If this is the case, though, the use cases are not completely clear
to me.

I also have the following questions:

* Can pointers be nested, '&&d' ?
* What nesting level can structures have? Arbitrary?
* The new array syntax claims "multi-dimensional array of whatever follows".

  Truly whatever? Arrays of structures? Arrays of pointers?
* "complex (whatever the next specifier is)".  Not really 'whatever'.  You
  can not have a 'complex bool' or 'complex int'.  What other types of
  complex are there besides complex double?
* How do array specifiers and pointer specifiers mix?  For example, would
  '(2, 2)&d' be a two-by-two array of pointers to doubles?  What about
  '&(2, 2)d'?  Is this a pointer to an two-by-two array of doubles?

The new features of the struct-string syntax are so different that I think
we
need to specify a grammar.  I think it will clarify some of the open
questions.

In addition, I was thinking that a reasonable implemenation strategy would
be to keep the current struct-string syntax mostly in place within the C
module
implementation.  The C implementation would just provide an interface to
pack\unpack sequences of primitive data elements.  Then we could write a
layer in the Python 'struct' module that took care of the higher-order
concepts like nested structures, arrays, named values, and pointers to
functions.  The higher-order concepts would be mapped to the appropriate
primitive sequence strings.

I think this will simplify the implementation and will provide a way to
phase
it.  We can implement the primitive type extensions in C first followed by
the higher-level Python stuff.  The result of each phase is immediately
usuable.

I have attached a patch against the PEP containing my current thoughts on
fleshing out the grammar and some of the current open questions.  This still
needs work, but I wanted to share to see if I am on the right track.
 Please advise on how to proceed.
msg99472 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-17 15:12
> And under what conditions would a ctype long double be used vs. a
> Decimal object.

Well, I'm guessing that this was really just an open question for the PEP, and that the PEP authors hadn't decided which of these two options was more appropriate.  If all long doubles were converted to Decimal, then we need to determine what precision is appropriate to use for the conversion: any long double *can* be represented exactly as a Decimal, but to get an exact representation can need thousands of digits in some cases, so it's probably better to always round to some fixed number of signficant digits.  36 significant digits is a reasonable choice here: it's the minimum number of digits that's guaranteed to distinguish two distinct long doubles, for the case where a long double has 113 bits of precision (i.e., IEEE 754 binary128 format);  other common long double formats have smaller precision than this (usually 53 (normal double), 64 (x87 extended doubles), or 106 (double double)).  There would probably also need to be some way to 'repack' the Decimal instance.

The 'platform long double -> Decimal' conversion itself would also be nontrivial to implement;  I can lend a hand here if you want it.

Using ctypes makes more sense to me, since it doesn't involve trying to mix decimal and binary, except that I don't know whether it's acceptable for other standard library modules to have dependencies on ctypes.  I'm not sure whether ctypes is available on all platforms that Python runs on. It's also a bit ugly that, depending on the platform, sometimes a long double will unpack to an instance of ctypes.long_double, and sometimes (when long double == double) to a regular Python float.

Anyway, this particular case (long double) isn't a big deal:  it can be overcome, one way or another.  I'm more worried about some of the other aspects of the changes.

[About unpacking with the 'O' format.]
> I guess if an object associated with the packed address does not
> exist, then you would unpack None (?).  This is especially a problem 
> if the struct-sting is being sent over the wire to another machine.

And how do you determine whether an address gives a valid object or not?  I can only assume that packing and unpacking with the 'O' format is only supposed to be used in certain restricted circumstances, but it's not clear to me what those circumstances are.

> I also have the following questions: [...]

I think a lot of this discussion needs to go back to python-dev;  with luck, we can get some advice and clarifications from the PEP authors there.  I'm not sure whether it's appropriate to modify the original PEP (especially since it's already accepted), or whether it would be better to produce a separate document describing the proposed changes in detail.
msg99474 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-17 15:18
I'm looking for previous discussions of this PEP.  There's a python-dev thread in April 2007:

http://mail.python.org/pipermail/python-dev/2007-April/072537.html

Are there other discussions that I'm missing?
msg99551 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-02-19 01:02
Mark,

> I think a lot of this discussion needs to go back to python-dev;  with 
> luck, we can get some advice and clarifications from the PEP authors 
> there.

So the next step is to kick off a thread on python-dev summarizing the questions\problems we have come up with?  I can get that started.

> Are there other discussions that I'm missing?

I did a quick search and came up with the same.
msg99655 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-21 13:06
Closed issue 2395 as a duplicate of this one.
msg99656 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-21 13:09
[Meador Inge]
> So the next step is to kick off a thread on python-dev summarizing the
> questions\problems we have come up with?  I can get that started.

Sounds good.  I'd really like to see some examples of how these struct-module additions would be used in real life.
msg99677 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-21 19:23
About long doubles again:  I just encountered someone on the #python IRC channel who wanted to know whether struct.pack and struct.unpack supported reading and writing of x87 80-bit long doubles (padded to 12 bytes each in the input).  A few quotes from him/her, with permission (responses from others, including me, edited out;  I can supply a fuller transcript if necessary, but I hope what's below isn't misleading).

[18:39] bdesk: Hi, is struct.pack able to handle 80-bit x86 extended floats?
[18:40] bdesk: How can I read and write these 80-bit floats, in binary, using python?
[18:44] bdesk: dickinsm: I have a C program that uses binary files as input and output, and I want to deal with these files using python if possible.
[18:49] bdesk: I don't need to do arithmetic with the full 80 bits of precision within the python program, although It would be better if I could.
[18:50] bdesk: I would need to use the float in a more semantically useful manner than treating it as a black box of 12 bytes.
[18:55] bdesk: Until python gets higher precision floats, my preferred interface would be to lose some precision when unpacking the floats.

The main thing that I realized from this is that unpacking as a ctypes long double isn't all that useful for someone who wants to be able to do arithmetic on the unpacked result.  And if you don't want to do arithmetic on the unpacked result, then you're probably just shuffling the bytes around without caring about their meaning, so there's no need to unpack as anything other than a sequence of 12 bytes.

On the other hand, I suppose it's enough to be able to unpack as a ctypes c_longdouble and then convert to a Python float (losing precision) for the arithmetic.  Alternatively, we might consider simply unpacking a long double directly into a Python float (and accepting the loss of precision);  that seems to be what would be most useful for the use-case above.
msg99711 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-02-22 04:13
> The main thing that I realized from this is that unpacking as a ctypes long 
> double isn't all that useful for someone who wants to be able to do arithmetic
> on the unpacked result.  

I agree.  Especially since ctypes 'long double' maps to a Python float and
'.value' would have to be referenced on the ctype 'long double' instance
for doing arithmetic.

> And if you don't want to do arithmetic on the unpacked result, then you're 
> probably just shuffling the bytes around without caring about their meaning,
> so there's no need to unpack as anything other than a sequence of 12 bytes.

One benefit of having a type code for 'long double' (assuming you are mapping
the value to the platform's 'long double') is that the you don't have to know 
how many bytes are in the underlying representation.  As you know, it isn't 
always just 12 bytes.  It depends on the architecture and ABI being used.  Which
from a quick sample, I am seeing can be anywhere from 8 to 16
bytes:

===========================================
| Compiler  | Arch     | Bytes            |
===========================================
| VC++ 8.0  | x86      | 8                |
| VC++ 9.0  | x86      | 8                |
| GCC 4.2.4 | x86      | 12 (default), 16 |
| GCC 4.2.4 | x86-64   | 12, 16 (default) |  
| GCC 4.2.4 | PPC IBM  | 16               |
| GCC 4.2.4 | PPC IEEE | 16               |
===========================================

> On the other hand, I suppose it's enough to be able to unpack as a ctypes 
> c_longdouble and then convert to a Python float (losing precision) for the 
> arithmetic.  Alternatively, we might consider simply unpacking a long double 
> directly into a Python float (and accepting the loss of precision);
 
I guess that would be acceptable.  The only thing that I don't like is that
since the transformation is lossy, you can't round trip:

   # this will not hold
   pack('g', unpack('g', byte_str)[0]) == byte_str

> that seems to be what would be most useful for the use-case above.

Which use case?  From the given IRC trace it seems that 'bdesk' was mainly
concerned with (1) pushing bytes around, but (2) thought it "it would be better"
to be able to do arithmetic and thought it would be more useful if it were
not a "black box of 12 bytes".  For use case (1) the loss of precision would 
probably not be acceptable, due to the round trip issue mentioned above.

So using ctypes 'long double' is easier to implement, but is lossy and clunky 
for arithmetic.  Using Python 'float' is easy to implement and easy for 
arithmetic, but is lossy.  Using Decimal is non-lossy and easy for arithmetic, 
but the implementation would be non-trivial and architecture specific 
(unless we just picked a fixed number of bytes regardless of the architecture).
msg99771 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-02-22 16:14
> One benefit of having a type code for 'long double' (assuming you are
> mapping the value to the platform's 'long double') is that the you
> don't have to know how many bytes are in the underlying representation.

Agreed:  it's nice to have struct.pack already know your machine.

Actually, this brings up (yet) another open question:  native packing/unpacking of a long double would presumably return something corresponding to the platform long double, as above;  but non-native packing/unpacking should do something standard, instead, for the sake of interoperability between platforms.  Currently, I believe that packing a Python float always---even in native mode---packs in IEEE 754 format, even when the platform doubles aren't IEEE 754.

For native packing/unpacking, I'm slowly becoming convinced that unpacking as a ctypes long double is the only thing that makes any sense, so that we keep round-tripping, as you point out.  The user can easily enough extract the Python float for numerical work.  I still don't like having the struct module depend on ctypes, though.
msg105952 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-05-18 04:07
Attached is a patch that implements part of the additions.  More specifically, the 'T{}' syntax and the ability to place byte-order specifiers ('<', '>', '@', '^', '!", '=') anywhere in the struct string.  

The changes dictated by the PEP are so big that it is better to split things up into multiple patches.  These two features will lay some ground work and are probably less controversial than the others.

Surely some more tweaks will be needed, but I think what I have now is at least good enough for review.  I tested on OS X 10.6 and Ubuntu 10.4.  I also used valgrind and 'regrtest.py -R:' to check for memory and 
reference leaks, respectively.
msg105955 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-18 07:12
Thanks for this.

Any chance you could upload the patch to Rietveld (http://codereview.appspot.com/) for ease of review?
msg105970 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-05-18 12:22
Sure - http://codereview.appspot.com/1258041
msg106087 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-19 19:27
Thanks for the Rietveld upload.  I haven't had a chance to review this properly yet, but hope to do so within the next few days.

One question: the production list you added to the docs says:

  format_string: (`byte_order_specifier`? `type_string`)*

This suggests that format strings like '<' and '<>b' are invalid;  is that correct, or should the production list be something like:

  format_string: (`byte_order_specifier` | `type_string`)*

?  Whether these cases are valid or not (personally, I think they should be), we should add some tests for them.  '<' *is* currently valid, I believe.


The possibility of mixing native size/alignment with standard size/alignment in a single format string makes me a bit uneasy, but I can't see any actual problems that might arise from it (equally, I can't imagine why anyone would want to do it).  I wondered briefly whether padding has clear semantics when a '@' appears in the middle of a format string, but I can't see why it wouldn't have.
msg106088 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-19 19:38
Travis, this issue is still assigned to you.  Do you plan to work on this at some stage, or may I unassign you?
msg106089 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-19 19:46
Hmm.  Something's not quite right: the _struct module fails to compile for me with this patch.  I get:

/Users/dickinsm/python/svn/py3k/Modules/_struct.c: In function ‘s_unpack’:
/Users/dickinsm/python/svn/py3k/Modules/_struct.c:1730: error: ‘PyStructObject’ has no member named ‘s_codes’
/Users/dickinsm/python/svn/py3k/Modules/_struct.c: In function ‘s_unpack_from’:
/Users/dickinsm/python/svn/py3k/Modules/_struct.c:1765: error: ‘PyStructObject’ has no member named ‘s_codes’

The offending lines both look like:

    assert(soself->s_codes != NULL);

presumably that should be:

    assert(soself->s_tree->s_codes != NULL);

After making that change, and successfully rebuilding, this assert triggers:

test_705836 (__main__.StructTest) ... Assertion failed: (soself->s_tree->s_codes != NULL), function s_unpack, file /Users/dickinsm/python/svn/py3k/Modules/_struct.c, line 1730.
msg106090 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-19 19:51
Ah, it should have been:

assert(soself->s_tree != NULL);

Got it now.  All tests pass. :)
msg106091 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-19 19:56
One more design question.  For something like: '<HT{>H}H', what endianness should be used when packing/unpacking the last 'H'?  Should the switch to '>' within the embedded struct be regarded as local to the struct?

With your patch, I get:

>>> pack('<HT{>H}H', 1, (2,), 3)
b'\x01\x00\x00\x02\x00\x03'
msg106153 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-05-20 13:50
> is that correct, or should the production list be something like:

Yup, you are right.  I will change the grammar.

> Whether these cases are valid or not (personally, I think they should 
> be), we should add some tests for them.  '<' *is* currently valid, I 
> believe.

I agree, they should be valid.  I will add more test cases.

> The possibility of mixing native size/alignment with standard 
> size/alignment in a single format string makes me a bit uneasy

I agree.  It is hard for me to see how this might be used.  In any case,
the relevant part of the PEP that I was following is:

"Endian-specification ('!', '@','=','>','<', '^') is also allowed inside the string so that it can change if needed. The previously-specified endian string is in force until changed. The default endian is '@' which means native data-types and alignment. If un-aligned, native data-types are requested, then the endian specification is '^'."

However, I am not quite sure how to interpret the last sentence.

> Should the switch to '>' within the embedded struct be regarded as 
> local to the struct?

No, there is no notion of scope here.  A given specifier is active until the next one is found.

> Ah, it should have been:
> 
> assert(soself->s_tree != NULL);

D'oh!  I missed that when I merge over to py3k -- I started this work on trunk.  Thanks.
msg106155 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-20 14:17
> > The possibility of mixing native size/alignment with standard 
> > size/alignment in a single format string makes me a bit uneasy
> 
> I agree.  It is hard for me to see how this might be used.

Without having anything more constructive to add, I also agree with this
gut feeling. Perhaps not all of the PEP needs implementing; we can just
add what is genuinely useful.
msg106157 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 14:26
Thanks for the new patch.

> "... If un-aligned, native data-types are requested, then the
> endian specification is '^'."
>
> However, I am not quite sure how to interpret the last sentence.

Hmm.  Seems like the PEP authors are proposing a new byteorder/alignment/size specifier here:  '^' = native byte-order + native size + no alignment.  I missed this before.

>> Should the switch to '>' within the embedded struct be regarded as 
>> local to the struct?

>No, there is no notion of scope here.  A given specifier is active >until the next one is found.

Okay.  I wonder whether that's the most useful thing to do, though.

As a separate issue, I notice that the new 'T{}' code doesn't respect multiplicities, e.g., as in 'H3T{HHL}'.  Is that intentional/desirable?

>>> struct.pack('H3T{HHL}', 1, (2, 3, 4))
b'\x01\x00\x02\x00\x03\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'

If we don't allow multiplicities, this should produce an exception, I think.  If we do allow multiplicities (and I don't immediately see why we shouldn't), then we're going to have to be clear about how endianness behaves in something like:

'>H3T{H<H}'

So the first inner struct here would be treated as '{>H<H}'.  Would the next two be identical to this, or would they be as though the whole thing were '>HT{H<H}T{H<H}T{H<H}', in which case the 2nd and 3rd substructs are both effectively '<H<H', while the first is '>H<H'.
msg106164 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 16:24
After a bit more thought (and after soliciting a couple of opinions on #python-dev), I'm convinced that endianness changes within a substruct should be local to that substruct:

- it makes the meaning of '>2T{H<H}' both unsurprising and easy to understand:  i.e., it would be interpreted exactly as '>T{H<H}T{H<H}', and  both substructs would behave like '>H<H'.

- I suspect it's the behaviour that people expect

- it may make dynamic creation of struct format strings easier/less bug-prone.


But now I've got a new open issue:  how much padding should be inserted/expected (for pack/unpack respectively) between the 'B' and the 'T{...}' in a struct format string of the form 'BT{...}'?

For this, I don't think we can hope to exactly comply with the platform ABI in all cases.  But I think there's a simple rule that would work 99% of the time, and that does match the ABI on common platforms (though I'll check this), namely that the alignment requirement of the 'T{...}' struct should be the least common multiple of the alignment requirements of any of its elements.  (Which will usually translate to the largest of the alignment requirements, since those alignments are usually all going to be powers of 2.)

And *this* is where things get tricky if the alignment/byteorder/size specifier is changed midstream, since then it doesn't seem clear what alignments would contribute to the lcm above.  I'm tempted to suggest that for native mode, changing the specifier be disallowed entirely.

Travis, any comments on any of this?
msg106168 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-05-20 17:17
> As a separate issue, I notice that the new 'T{}' code doesn't respect 
> multiplicities, e.g., as in 'H3T{HHL}'.  Is that 
> intentional/desirable?

That could have been an oversight on my part.  I don't see any immediate reason why we wouldn't allow it.

> But now I've got a new open issue:  how much padding should be 
> inserted/expected (for pack/unpack respectively) between the 'B' and 
> the 'T{...}' in a struct format string of the form 'BT{...}'?

Doesn't that depend on what is in the '...'?  For example, I would expect the same padding for 'BT{I}' and 'BI'.  In general, I would expect the padding to be the same for 'x+T{y+}' and 'x+y+'.  The 'T{...}'s are merely organizational, right?

> I'm tempted to suggest that for native mode, changing the specifier be 
> disallowed entirely.

I am tempted to suggest that we just go back to having one specifier at the beginning of the string :).  Things seem to be getting complicate without any clear benefits.
msg106173 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 17:30
> For example, I would expect the same padding for 'BT{I}' and 'BI'.

Granted, yes.  But I wouldn't expect the same padding for 'BT{BI}' and 'BBI'.  'BT{BI}' should match a C struct which itself has an embedded struct.  For C, I get the following results on my machine:

#include <stdio.h>

/* corresponds to 'T{BI}' */
typedef struct {
  char y;
  int z;
} A;

/* corresponds to 'BT{BI}' */
typedef struct {
  char x;
  A yz;
} B;

/* corresponds to 'BBI' */
typedef struct {
  char x;
  char y;
  int z;
} C;

int main(void) {
  printf("sizeof(A) = %zu\n", sizeof(A));
  printf("sizeof(B) = %zu\n", sizeof(B));
  printf("sizeof(C) = %zu\n", sizeof(C));
  return 0;
}

/*                                                                               
Results on a (64-bit) OS X 10.6 machine:                                         
                                                                                 
sizeof(A) = 8                                                                    
sizeof(B) = 12                                                                   
sizeof(C) = 8                                                                    
*/
msg106175 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 17:33
(whoops -- used unsigned struct codes to correspond to signed C types there;  but it shouldn't make any difference to the sizes and padding).

> I am tempted to suggest that we just go back to having one specifier at the beginning of the string :).  Things seem to be getting complicate without any clear benefits.

Agreed.  Though if anyone following this issue wants to make the case that there are benefits to being able to change the endianness midway through a string, please do so!
msg106177 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-05-20 18:13
> Granted, yes.  But I wouldn't expect the same padding for 'BT{BI}' and 
> 'BBI'.  'BT{BI}' should match a C struct which itself has an embedded 
> struct.  For C, I get the following results on my machine:

I wasn't sure.  The C99 standard does not specify what the behavior should be.  It is implementation defined.  I guess most implementations just set the alignment of the struct with the alignment of its most demanding member.

I need to change how the alignment for nested structures is computed.  Right now alignments are being computed as if the 'T{...}' codes were not there.  I will hold off until we decide what that rule should be, but I think the most demanding element rule seems reasonable.
msg106180 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 18:50
> The C99 standard does not specify what the behavior should be.

Right;  it's down to the platform ABI.

I think the least common multiple of the alignment requirements of the struct members is the way to go, though.  It's difficult to imagine an ABI for which this lcm isn't the same thing as the largest struct member alignment, but I don't want to categorically say that such ABIs don't exist.

Here's a snippet from the gcc manual [1]:

"Note that the alignment of any given struct or union type is required by the ISO C standard to be at least a perfect multiple of the lowest common multiple of the alignments of all of the members of the struct or union in question."

I'm not sure I could identify the precise pieces of the standard that imply that requirement, though.

[1] http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Type-Attributes.html#Type-Attributes
msg106181 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 19:01
Another snippet, from the latest public draft of the System V x86-64 ABI [1]:

"""Structures and unions assume the alignment of their most strictly aligned compo- nent. Each member is assigned to the lowest available offset with the appropriate alignment. The size of any object is always a multiple of the object‘s alignment."""

I'd be fine with using the largest alignment, as above, instead of computing an lcm;  I can't believe it'll ever make a difference in practice.  For an empty struct (not allowed in C99, but allowed as a gcc extension, and allowed by the struct module), the alignment would be 1, of course.

[1] http://www.x86-64.org/documentation/abi.pdf
msg106188 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-20 20:49
One more reference:

http://msdn.microsoft.com/en-us/library/9dbwhz68(v=VS.80).aspx

gives essentially the same rules for MSVC. "The alignment of the beginning of a structure or a union is the maximum alignment of any individual member. Each member within the structure or union must be placed at its proper alignment as defined in the previous table, which may require implicit internal padding, depending on the previous member."
msg106416 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2010-05-25 04:26
On May 19, 2010, at 2:38 PM, Mark Dickinson wrote:

> 
> Mark Dickinson <dickinsm@gmail.com> added the comment:
> 
> Travis, this issue is still assigned to you.  Do you plan to work on this at some stage, or may I unassign you?
> 

You may unassign it from me.   Unfortunately, I don't have time anymore to work on it and I don't see that changing in the coming months. 

Thanks,

-Travis
msg123093 - (view) Author: Pauli Virtanen (pv) * Date: 2010-12-02 18:14
For reference, Numpy's PEP 3118 implementation is here:

http://github.com/numpy/numpy/blob/master/numpy/core/_internal.py#L357

http://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/buffer.c#L76

It would be a good idea to ensure that the numpy and struct implementations are in agreement about details of the format strings.
(I wouldn't take the Numpy implementation as the definitive one, though.)

- The "sub-structs" in Numpy arrays (in align=True mode) are aligned
  according to the maximum alignment of the fields.

- I assumed the 'O' format in the PEP is supposed to be similar to Numpy
  object arrays. This implies some reference counting semantics. The
  Numpy PEP 3118 implementation assumes the memory contains borrowed
  references, valid at least until the buffer is released.
  Unpacking 'O' should probably INCREF whatever PyObject* pointer is
  there.

- I assumed the alignment specifiers were unscoped. I'm not sure
  however whether this is the best thing to do.

- The function pointers and pointers to pointers were not implemented.
  (Numpy cannot represent those as data types.)
msg123204 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-03 08:36
> For reference, Numpy's PEP 3118 implementation is here:

Thanks for that, and the other information you give;  that's helpful.

It sounds like we're on the same page with respect to alignment of substructs.  (Bar the mostly academic question of max versus lcm.)

I still like the idea of scoped endianness markers in the substructs, but  if we have to abandon that for compatibility with NumPy that's okay.

- I assumed the 'O' format in the PEP is supposed to be similar to Numpy
  object arrays. This implies some reference counting semantics. The
  Numpy PEP 3118 implementation assumes the memory contains borrowed
  references, valid at least until the buffer is released.
  Unpacking 'O' should probably INCREF whatever PyObject* pointer is
  there.

I'm still confused about how this could work:  when unpacking, how do you know whether the PyObject* pointer points to a valid object or not?  You can ensure that the pointer will always point to a valid object by having the *pack* operation increment reference counts, but then you need a way to automatically decref when the packed string goes out of scope.  So the object returned by 'pack' would somehow have to be something other than a plain string, so that it can deal with automatically doing the DECREF of the held PyObject* pointers when it goes out of scope.

What's the need to have the 'O' format in the struct module?  Is it really necessary there?  Can we get away with not implementing it?
msg123205 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-03 08:38
> For reference, Numpy's PEP 3118 implementation is here

BTW, does this already exist in a released version of NumPy?  If not, when is it likely to appear in the wild?
msg123226 - (view) Author: Pauli Virtanen (pv) * Date: 2010-12-03 10:33
> I still like the idea of scoped endianness markers in the substructs,
> but  if we have to abandon that for compatibility with NumPy that's
> okay.

That, or change the Numpy implementation. I don't believe there's yet much code in the wild that changes the alignment specifier on the fly.

[clip: 'O' format code]
> So the object returned by 'pack' would somehow
> have to be something other than a plain string, so that it can deal
> with automatically doing the DECREF of the held PyObject* pointers
> when it goes out of scope.

Yes, the packed object would need to own the references, and it would be the responsibility of the provider of the buffer to ensure that the pointers are valid.

It seems that it's not possible for the `struct` module to correctly implement packing for the 'O' format. Unpacking could be possible, though (but then if you don't have packing, how write tests for it?).

Another possibility is to implement the 'O' format unsafely and leave managing the reference counting to whoever uses the `struct` module's capabilities. (And maybe return ctypes pointers on unpacking.)

[clip]
> What's the need to have the 'O' format in the struct module?  Is it
> really necessary there?  Can we get away with not implementing it?

Numpy arrays, when containing Python objects, function as per the 'O' format.

However, for the struct module, I don't see what would be the use case for the 'O' format.

> BTW, does this already exist in a released version of NumPy?  If not,
> when is it likely to appear in the wild?

It's included since the 1.5.0 release which came out last July.

    ***

I think after the implementation is done, the PEP probably needs to be amended with clarifications (and possibly cutting out what is not really needed).
msg123366 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-12-04 16:02
> Another possibility is to implement the 'O' format unsafely [...]

Hmm.  I don't much like that idea.  Historically, it's supposed to be very difficult to segfault the Python interpreter with pure Python code (well except if you're using ctypes, I guess).
msg125617 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-01-07 03:59
Attached is the latest version of the struct string patch.  I tested on OS X 10.6.5 (64-bit) and Ubuntu 10.04 (32-bit).  I also scanned for memory problems with Valgrind.  There is one test failing on 32-bit systems ('test_crasher').  This is due to the fact that 'struct.pack("357913941b", ...)' no longer tries to allocate 357913941 format codes.  This implementation just allocates *one* code and assigns a count of 357913941, which is utilized later when packing/unpacking.  Some work could be done to add better large memory consumption checks, though.

Previous feedback has been incorporated:

   1. Multiplicities allowed on struct specifiers.
   2. Maximum alignment rule.
   3. Struct nesting depth limited (64 levels).
   4. The old behavior of only one byte order specified.  However, 
      the code is written in a way such that the scoped behavior
      would be easy to add.

As before, there will surely be more iterations, but this is good enough for general review to see if things are headed in the right direction.  

This is a difficult one for review because the diffs are really large.  I placed a review on Rietveld here: http://codereview.appspot.com/3863042/.  If anyone has any ideas on how to reduce the number number of diffs (perhaps a way to do multiple smaller pataches), then that would be cool.  I don't see an obvious way to do this at this point.
msg130694 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-03-12 19:33
Is there still any interest in this work?
msg130695 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2011-03-12 19:41
Yes, there's interest (at least here).  I've just been really short on Python-time recently, so haven't found time to review your patch.
msg130696 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2011-03-12 19:44
I'm going to unassign for now;  I still hope to look at this at some point, but can't see a time in the near future when it's going to happen.
msg143505 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-09-05 01:44
Is this work something that might be suitable for the features/pep-3118 repo (http://hg.python.org/features/pep-3118/) ?
msg143509 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-05 07:53
Yes, definitely. I'm going to push a new memoryview implementation
(complete for all 1D/native format cases) in a couple of days.

Once that is done, perhaps we could create a memoryview-struct
branch on top of that.
msg167963 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-08-11 14:56
Following up here after rejecting #15622 as invalid

The "unicode" codes in PEP 3118 need to be seriously rethought before any related changes are made in the struct module.

1. The 'c' and 's' codes are currently used for raw bytes data (represented as bytes objects at the Python layer). This means the 'c' code cannot be used as described in PEP 3118 in a world with strict binary/text separation.

2. Any format codes for UCS1, UCS2 and UCS4 are more usefully modelled on 's' than they are on 'c' (so that repeat counts create longer strings rather than lists of strings that each contain a single code point)

3. Given some of the other proposals in PEP 3118, it seems more useful to define an embedded text format as "S{<encoding>}".

UCS1 would then be "S{latin-1}", UCS2 would be approximated as "S{utf-16}" and UCS4 would be "S{utf-32}" and arbitrary encodings would also be supported. struct packing would implicitly encode from text to bytes while unpacking would implicitly decode bytes to text. As with 's' a length mismatch in the encoded form would mean an error.
msg187583 - (view) Author: Paul Hoffman (paulehoffman) * Date: 2013-04-22 19:35
Following up on http://mail.python.org/pipermail/python-ideas/2011-March/009656.html, I would like to request that struct also handle half-precision floats directly. It's a short change, and half-precision floats are becoming much more popular in applications.

Adding this to struct would also maybe need to change math.isinf and math.isnan, but maybe not.
msg187589 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-04-22 20:14
Paul: there's already an open issue for adding float16 to the struct module: see issue 11734.
msg187591 - (view) Author: Paul Hoffman (paulehoffman) * Date: 2013-04-22 20:18
Whoops, never mind. Thanks for the pointer to 11734.
msg263321 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-04-13 10:20
Here's a grammar that roughly describes the subset that NumPy supports.

As for implementing this in the struct module: There is a new data
description language on the horizon:

  http://datashape.readthedocs.org/en/latest/


It does not have all the low-level capabilities (e.g changing alignment
on the fly), but it is far more readable. Example:

PEP-3118:  "(2,3)10f0fZdT{10B:x:(2,3)d:y:Q:z:}B"
Datashape: "2 * 3 * (10 * float32, 0 * float32, complex128, {x: 10 * uint8, y: 2 * 3 * float64, z: int64}, uint8)"


There are a lot of open questions still. Should "10f" be viewed as an
array[10] of float, i.e. equivalent to (10)f?

In the context of PEP-3118, I think so.
History
Date User Action Args
2022-04-11 14:56:35adminsetgithub: 47382
2016-04-13 10:20:21skrahsetversions: + Python 3.6, - Python 3.3
2016-04-13 10:20:08skrahsetfiles: + grammar.y
nosy: + skrah
messages: + msg263321

2014-10-14 16:49:42skrahsetnosy: - skrah
2013-11-29 11:07:39skrahlinkissue19803 superseder
2013-04-22 20:18:40paulehoffmansetnosy: barry, teoliphant, mark.dickinson, ncoghlan, belopolsky, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, Arfrever, noufal, skrah, meador.inge, martin.panter, paulehoffman
messages: + msg187591
2013-04-22 20:14:27mark.dickinsonsetmessages: + msg187589
2013-04-22 19:35:25paulehoffmansetnosy: + paulehoffman
messages: + msg187583
2012-10-27 03:46:16martin.pantersetnosy: + martin.panter
2012-08-21 03:14:59belopolskysetnosy: + belopolsky, - Alexander.Belopolsky
2012-08-11 14:56:58ncoghlansetnosy: + ncoghlan
messages: + msg167963
2012-08-11 14:40:31ncoghlanunlinkissue15622 dependencies
2012-08-11 10:15:40Arfreversetnosy: + Arfrever
2012-08-11 10:08:26skrahlinkissue15622 dependencies
2011-09-05 07:53:48skrahsetnosy: + skrah
messages: + msg143509
2011-09-05 01:44:51meador.ingesetmessages: + msg143505
2011-03-12 19:44:04mark.dickinsonsetassignee: mark.dickinson ->
messages: + msg130696
nosy: barry, teoliphant, mark.dickinson, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, noufal, meador.inge, Alexander.Belopolsky
2011-03-12 19:41:51mark.dickinsonsetnosy: barry, teoliphant, mark.dickinson, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, noufal, meador.inge, Alexander.Belopolsky
messages: + msg130695
2011-03-12 19:33:21meador.ingesetnosy: barry, teoliphant, mark.dickinson, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, noufal, meador.inge, Alexander.Belopolsky
messages: + msg130694
2011-01-08 13:39:03pitrousetassignee: meador.inge -> mark.dickinson
stage: needs patch -> patch review
nosy: barry, teoliphant, mark.dickinson, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, noufal, meador.inge, Alexander.Belopolsky
versions: + Python 3.3, - Python 3.2
2011-01-07 03:59:10meador.ingesetfiles: + struct-string.py3k.3.patch
nosy: barry, teoliphant, mark.dickinson, pitrou, inducer, ajaksu2, MrJean1, benjamin.peterson, pv, noufal, meador.inge, Alexander.Belopolsky
messages: + msg125617
2010-12-04 16:02:40mark.dickinsonsetmessages: + msg123366
2010-12-03 10:33:46pvsetmessages: + msg123226
2010-12-03 08:38:53mark.dickinsonsetmessages: + msg123205
2010-12-03 08:36:31mark.dickinsonsetmessages: + msg123204
2010-12-02 18:14:17pvsetnosy: + pv
messages: + msg123093
2010-08-14 20:13:18meador.ingesetpriority: critical -> high
assignee: teoliphant -> meador.inge
stage: test needed -> needs patch
2010-05-25 04:26:31teoliphantsetmessages: + msg106416
2010-05-20 20:49:08mark.dickinsonsetmessages: + msg106188
2010-05-20 19:01:14mark.dickinsonsetmessages: + msg106181
2010-05-20 18:50:03mark.dickinsonsetmessages: + msg106180
2010-05-20 18:13:23meador.ingesetmessages: + msg106177
2010-05-20 17:33:57mark.dickinsonsetmessages: + msg106175
2010-05-20 17:30:35mark.dickinsonsetmessages: + msg106173
2010-05-20 17:17:40meador.ingesetmessages: + msg106168
2010-05-20 16:24:48mark.dickinsonsetmessages: + msg106164
2010-05-20 14:26:13mark.dickinsonsetmessages: + msg106157
2010-05-20 14:17:42pitrousetmessages: + msg106155
2010-05-20 13:50:43meador.ingesetfiles: + struct-string.py3k.2.patch

messages: + msg106153
2010-05-19 19:56:16mark.dickinsonsetmessages: + msg106091
2010-05-19 19:51:30mark.dickinsonsetmessages: + msg106090
2010-05-19 19:46:19mark.dickinsonsetmessages: + msg106089
2010-05-19 19:38:54mark.dickinsonsetmessages: + msg106088
2010-05-19 19:27:14mark.dickinsonsetmessages: + msg106087
2010-05-18 12:22:25meador.ingesetmessages: + msg105970
2010-05-18 07:12:57mark.dickinsonsetmessages: + msg105955
2010-05-18 04:08:01meador.ingesetfiles: + struct-string.py3k.patch

messages: + msg105952
2010-04-20 18:26:40noufalsetnosy: + noufal
2010-03-01 16:24:10inducersetnosy: + inducer
2010-02-26 05:37:20Alexander.Belopolskysetnosy: + Alexander.Belopolsky
2010-02-22 16:14:46mark.dickinsonsetmessages: + msg99771
2010-02-22 04:13:52meador.ingesetmessages: + msg99711
2010-02-21 19:23:08mark.dickinsonsetmessages: + msg99677
2010-02-21 13:09:18mark.dickinsonsetmessages: + msg99656
2010-02-21 13:06:18mark.dickinsonsetmessages: + msg99655
2010-02-21 13:05:24mark.dickinsonlinkissue2395 superseder
2010-02-21 13:05:24mark.dickinsonunlinkissue2395 dependencies
2010-02-19 01:02:23meador.ingesetmessages: + msg99551
2010-02-17 15:18:06mark.dickinsonsetmessages: + msg99474
2010-02-17 15:12:23mark.dickinsonsetmessages: + msg99472
2010-02-17 03:30:54meador.ingesetfiles: - unnamed
2010-02-17 03:30:15meador.ingesetfiles: + unnamed, pep-3118.patch
keywords: + patch
messages: + msg99460
2010-02-13 12:01:36mark.dickinsonsetmessages: + msg99313
2010-02-13 11:07:31mark.dickinsonsetmessages: + msg99312
2010-02-13 06:06:32teoliphantsetmessages: + msg99309
2010-02-13 01:35:13benjamin.petersonsetmessages: + msg99297
2010-02-13 01:29:00meador.ingesetnosy: + meador.inge
messages: + msg99296
2009-12-24 23:14:07mark.dickinsonsetnosy: + mark.dickinson
2009-05-16 20:33:49ajaksu2linkissue2395 dependencies
2009-05-16 20:33:40ajaksu2setversions: + Python 3.2, - Python 3.1
nosy: + ajaksu2

messages: + msg87921

stage: test needed
2008-08-24 21:38:02teoliphantsetmessages: + msg71882
2008-08-18 15:51:22pitrousetmessages: + msg71342
2008-08-18 15:02:32benjamin.petersonsetmessages: + msg71338
2008-08-18 09:36:17pitrousetpriority: release blocker -> critical
nosy: + pitrou
messages: + msg71316
components: + Library (Lib)
versions: + Python 3.1, - Python 3.0
2008-08-18 03:00:57barrysetnosy: + barry
messages: + msg71313
2008-07-31 02:17:13benjamin.petersonsetpriority: critical -> release blocker
2008-06-21 15:59:21MrJean1setnosy: + MrJean1
messages: + msg68507
2008-06-17 22:30:31benjamin.petersoncreate