This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Struct incorrectly compiles format strings
Type: behavior Stage: needs patch
Components: Documentation, Extension Modules Versions: Python 3.1, Python 3.2, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: eric.smith, mark.dickinson, meador.inge, sgk284
Priority: low Keywords: easy, patch

Created on 2009-11-19 09:42 by sgk284, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue-7355.patch meador.inge, 2010-04-11 01:41
unnamed meador.inge, 2010-04-12 22:11
Messages (18)
msg95467 - (view) Author: Steve Krenzel (sgk284) Date: 2009-11-19 09:41
The struct module has a calcsize() method which reports the size of the data for a specified format 
string. In some instances, to the best of my knowledge, this is wrong.

To repro:
>>> from struct import calcsize
>>> calcsize("ci")
8
>>> calcsize("ic")
5

The correct answer is 5 (a single byte character and a four byte int take up 5 bytes of space). For 
some reason when a 'c' is followed by an 'i', this is wrong and instead allocates 4 bytes to the 'c'.
This has been verified in 2.6 and 2.5.

You can also repro this by using 's', '2c', and similar combinations in place of 'c'. as well as 'I' 
in place of 'i'. This might effect other combinations as well.
msg95468 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-11-19 12:46
It's a padding issue, having to do with putting values at the correct
word boundaries.
msg95474 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-11-19 14:33
What Eric said.  You can see the padding explicitly in the results of
struct.pack:

>>> struct.pack("ci", '*', 0x12131415) # 8-byte result, 3 padding bytes
'*\x00\x00\x00\x15\x14\x13\x12'
>>> struct.pack("ic", 0x12131415, '*') # 5-byte result, no padding.
'\x15\x14\x13\x12*'

Note the 3 zero bytes in the first result string.

This gets reported frequently enough that I wonder whether the docs
should be rearranged and/or expanded.  The existence of padding is
mentioned, but not particularly prominently or thoroughly.
msg95480 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-11-19 15:02
Reopening for possible doc clarification.  Suggestions welcome!
msg95508 - (view) Author: Steve Krenzel (sgk284) Date: 2009-11-19 19:03
Just for clarification, why does "ci" get padded but "ic" doesn't? 
While I agree that updating the documentation would help clarify, 
perhaps either everything should be padded to word boundaries or 
nothing should.

It is weird behavior that "ic" != "ci". If both formats were 8 bytes 
then my first thought would have been "Oh, it's just getting padded", 
but with this inconsistency it appeared as a bug.

Whatever the reason behind this discrepancy is, it should definitely be 
included in the doc updates.
msg95509 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-11-19 19:07
It's basically because nothing comes after it. If you put something
after it, such as a zero length integer, you'll see:

>>> from struct import calcsize
>>> calcsize("ci")
8
>>> calcsize("ic")
5
>>> calcsize("ic0i")
8
msg95510 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-11-19 19:10
> Just for clarification, why does "ci" get padded but "ic" doesn't?

Because no padding is necessary in the second case:  both the integer and 
the character already start at a position that's a multiple of 4---the 
integer at position 0 and the character at position 4.

In the first case, without padding, the integer wouldn't start at a word 
boundary.

The aim is to make sure that the byte sequence output by struct.pack 
matches the layout of a corresponding C struct.  In the first case inter-
item padding is necessary to make that work, in the second it isn't.

You could argue that in the second case, Python should add trailing 
padding, but I'm not sure what the point would be.
msg95512 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-11-19 19:47
I'm half-convinced that struct.pack *should* ideally add trailing padding 
in the same situation that C does, for consistency with C.  Then calcsize 
would match C's sizeof.  If you're writing or reading a struct from C, 
it's probably easiest/most natural to write or read sizeof(my_struct) 
bytes, rather than worrying about stripping trailing padding for 
efficiency.

I don't see a sensible way to make this change without breaking backwards 
compatibility, though.

(Note: this still wouldn't mean that the calcsize result would be 
independent of order: calcsize('cci') and calcsize('cic') would still be 
different, for example, on a typical platform.)  

Eric's solution of adding '0i' should be included in the documentation 
update.
msg102790 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-04-10 19:05
> I'm half-convinced that struct.pack *should* ideally add trailing 
> padding in the same situation that C does, for consistency with C.  
> Then calcsize would match C's sizeof.

Maybe...  AFAIK, the C language does not mandate structure padding.
Section 6.7.2.1 of the C99 standard says that "unnamed padding" may be inserted anywhere inside of a structure object *except* at the beginning.  In other words, padding between members and trailing  padding is very implementation specific.  

When and why things are padded usually depends on the compiler, OS, and computer architecture.  For example, padding between members is usually
just an optimization for architectures like x86, but on architectures like ARM and MIPS you may end up with alignment faults.  Moreover, I 
think trailing padding is typically added to maintain alignment for 
arrays of structures, which again may be an optimization or correctness issue depending on the architecture.

So, I don't think we can match what C does since C leaves it open.
Perhaps we could match the C implementation that compiles the 'struct'
module, though?  Maybe there is a way we can compute the trailing padding in a fashion similar to how alignment is computed for various
types in the '*_ALIGN' macros in '_struct.c'?

Another thing that came to mind when thinking about structure layout issues was whether it might be useful to provide "packed" structures, e.g. no padding.  If the goal is to provide a way to read\write C-compatible structs, then currently there is no easy way to handle structures that have been serialized after being compiled with something like GCC's '-fpack-struct' option.
msg102792 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-04-10 20:26
> When and why things are padded usually depends on the compiler, OS,
> and computer architecture. [...]

Sure.  The struct module has a rather simplistic set of rules for adding padding, but I believe that it manages to match the platform's C rules (assuming no relevant extra compile flags) in most cases.

If we wanted something that's *guaranteed* to match C then I agree that's probably not easy within the struct module itself;  it probably belongs in ctypes somewhere instead.

For trailing padding, one fairly obvious and simple rule to use (to match the simplicity of the current rules for padding) would be to add enough padding so that the alignment of the struct matches the largest alignment of any member of the struct.  So e.g. on a platform where doubles are aligned to 8 bytes and ints to 4 bytes, 'db' would have 7 bytes of trailing padding, while 'iib' would have 3 bytes.  'b' would have no trailing padding at all.  I *think* (but would have to check) that this again matches what happens on common platforms.

> Another thing that came to mind when thinking about structure layout
> issues was whether it might be useful to provide "packed" structures

These already exist, provided that you're willing to accept 'standard' rather than 'native' sizes.  The struct modes for '<', '>', '=' and '!' never introduce padding (unless it's explicitly asked for with 'x').

---

On the subject of documentation, I think there are two points that need to be made more clearly in the current docs:

(1) *By default* (i.e. without a '<', '>', ...), padding *is* used;  but is automatically turned off with '<', '>', '=', '!'.  Ideally this should be mentioned very early on, since it seems to trip up a lot of people.

(2) Padding is only added between successive struct members; there's no trailing padding.  But padding can be forced using Eric's trick (though you end up having to choose exactly what alignment you want, e.g. by specifying '0i' rather than '0q' or '0d').
msg102805 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-04-11 01:41
> would be to add enough padding so that the alignment of the struct 
> matches the largest alignment of any member of the struct. 

That might work.  I will have to think about it a bit.

> On the subject of documentation

Attached a doc patch which addresses Mark's two points plus the following:

(1) Organized things more clearly into sub-sections.
(2) Fixed up the "notes" column from the '__index__' change that
    was submitted last week.
(3) Added some examples.
msg102968 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-04-12 20:02
Many thanks for the patch.

Applied (minus trailing whitespace :) in r80013.

Leaving open to remind me to merge to other branches.
msg102970 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-04-12 20:35
Hmm.  I note that the examples are for a big-endian machine.  I wonder whether they should be converted to little-endian, given that most users are on x86 or x86_64 hardware these days.
msg102971 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-04-12 20:39
> wonder whether they should be converted to little-endian

I thought about that as well.  It sure would make creating new examples easier :)  I had to construct the new examples "by hand".

I can regenerate the examples for little-endian if you would like?
msg102973 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-04-12 21:09
Sure, if you want to.  [Then I won't be able to reproduce what's in the docs on either of my machines. :)  One's 32-bit big-endian;  the other's 64-bit little-endian.]

Merged your changes and subsequent tweaks to py3k in r80016.
msg102981 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-04-12 22:11
> Sure, if you want to.  [Then I won't be able to reproduce what's in the
> docs on either of my machines. :)  One's 32-bit big-endian;  the other's
> 64-bit little-endian.]
>

I guess it's painful either way.  I think the only fair thing to do is
provide examples for multiple scenarios, e.g. 32-bit little-endian, 64-bit
big-endian, etc...  Share the pain :)

Cool, thanks!  Your padding note tweak is a definite improvement over what I
originally had.
msg102983 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-04-12 22:19
Those last two sentences where meant to be in response to your "Merged your changes and subsequent tweaks to py3k in r80016." comment.  I tried to reply to the tracker from my mail client and things got reformatted :(
msg106321 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010-05-22 18:59
Doc changes merged in r81477 (release26-maint) and r81480 (release31-maint).
History
Date User Action Args
2022-04-11 14:56:54adminsetgithub: 51604
2013-04-02 13:37:44mark.dickinsonlinkissue17617 superseder
2010-05-22 18:59:00mark.dickinsonsetstatus: open -> closed
resolution: fixed
messages: + msg106321
2010-04-12 22:19:29meador.ingesetmessages: + msg102983
2010-04-12 22:11:18meador.ingesetfiles: + unnamed

messages: + msg102981
2010-04-12 21:09:03mark.dickinsonsetmessages: + msg102973
2010-04-12 20:39:38meador.ingesetmessages: + msg102971
2010-04-12 20:36:00mark.dickinsonsetmessages: + msg102970
2010-04-12 20:02:13mark.dickinsonsetmessages: + msg102968
versions: - Python 2.7
2010-04-11 01:41:19meador.ingesetfiles: + issue-7355.patch
keywords: + patch
messages: + msg102805
2010-04-10 20:26:42mark.dickinsonsetmessages: + msg102792
2010-04-10 19:05:25meador.ingesetmessages: + msg102790
2010-04-10 13:04:36meador.ingesetnosy: + meador.inge
2009-11-19 19:47:01mark.dickinsonsetmessages: + msg95512
2009-11-19 19:10:20mark.dickinsonsetmessages: + msg95510
2009-11-19 19:07:51eric.smithsetmessages: + msg95509
2009-11-19 19:03:18sgk284setmessages: + msg95508
2009-11-19 15:02:36mark.dickinsonsetstatus: closed -> open
priority: low

components: + Documentation, Extension Modules, - Library (Lib)
assignee: mark.dickinson
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.5
keywords: + easy
resolution: not a bug -> (no value)
messages: + msg95480
stage: resolved -> needs patch
2009-11-19 14:33:02mark.dickinsonsetnosy: + mark.dickinson
messages: + msg95474
2009-11-19 12:46:03eric.smithsetstatus: open -> closed

nosy: + eric.smith
messages: + msg95468

resolution: not a bug
stage: resolved
2009-11-19 09:42:00sgk284create