msg95467 - (view) |
Author: Steve Krenzel (sgk284) |
Date: 2009-11-19 09:41 |
The struct module has a calcsize() method which reports the size of the data for a specified format
string. In some instances, to the best of my knowledge, this is wrong.
To repro:
>>> from struct import calcsize
>>> calcsize("ci")
8
>>> calcsize("ic")
5
The correct answer is 5 (a single byte character and a four byte int take up 5 bytes of space). For
some reason when a 'c' is followed by an 'i', this is wrong and instead allocates 4 bytes to the 'c'.
This has been verified in 2.6 and 2.5.
You can also repro this by using 's', '2c', and similar combinations in place of 'c'. as well as 'I'
in place of 'i'. This might effect other combinations as well.
|
msg95468 - (view) |
Author: Eric V. Smith (eric.smith) * |
Date: 2009-11-19 12:46 |
It's a padding issue, having to do with putting values at the correct
word boundaries.
|
msg95474 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2009-11-19 14:33 |
What Eric said. You can see the padding explicitly in the results of
struct.pack:
>>> struct.pack("ci", '*', 0x12131415) # 8-byte result, 3 padding bytes
'*\x00\x00\x00\x15\x14\x13\x12'
>>> struct.pack("ic", 0x12131415, '*') # 5-byte result, no padding.
'\x15\x14\x13\x12*'
Note the 3 zero bytes in the first result string.
This gets reported frequently enough that I wonder whether the docs
should be rearranged and/or expanded. The existence of padding is
mentioned, but not particularly prominently or thoroughly.
|
msg95480 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2009-11-19 15:02 |
Reopening for possible doc clarification. Suggestions welcome!
|
msg95508 - (view) |
Author: Steve Krenzel (sgk284) |
Date: 2009-11-19 19:03 |
Just for clarification, why does "ci" get padded but "ic" doesn't?
While I agree that updating the documentation would help clarify,
perhaps either everything should be padded to word boundaries or
nothing should.
It is weird behavior that "ic" != "ci". If both formats were 8 bytes
then my first thought would have been "Oh, it's just getting padded",
but with this inconsistency it appeared as a bug.
Whatever the reason behind this discrepancy is, it should definitely be
included in the doc updates.
|
msg95509 - (view) |
Author: Eric V. Smith (eric.smith) * |
Date: 2009-11-19 19:07 |
It's basically because nothing comes after it. If you put something
after it, such as a zero length integer, you'll see:
>>> from struct import calcsize
>>> calcsize("ci")
8
>>> calcsize("ic")
5
>>> calcsize("ic0i")
8
|
msg95510 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2009-11-19 19:10 |
> Just for clarification, why does "ci" get padded but "ic" doesn't?
Because no padding is necessary in the second case: both the integer and
the character already start at a position that's a multiple of 4---the
integer at position 0 and the character at position 4.
In the first case, without padding, the integer wouldn't start at a word
boundary.
The aim is to make sure that the byte sequence output by struct.pack
matches the layout of a corresponding C struct. In the first case inter-
item padding is necessary to make that work, in the second it isn't.
You could argue that in the second case, Python should add trailing
padding, but I'm not sure what the point would be.
|
msg95512 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2009-11-19 19:47 |
I'm half-convinced that struct.pack *should* ideally add trailing padding
in the same situation that C does, for consistency with C. Then calcsize
would match C's sizeof. If you're writing or reading a struct from C,
it's probably easiest/most natural to write or read sizeof(my_struct)
bytes, rather than worrying about stripping trailing padding for
efficiency.
I don't see a sensible way to make this change without breaking backwards
compatibility, though.
(Note: this still wouldn't mean that the calcsize result would be
independent of order: calcsize('cci') and calcsize('cic') would still be
different, for example, on a typical platform.)
Eric's solution of adding '0i' should be included in the documentation
update.
|
msg102790 - (view) |
Author: Meador Inge (meador.inge) * |
Date: 2010-04-10 19:05 |
> I'm half-convinced that struct.pack *should* ideally add trailing
> padding in the same situation that C does, for consistency with C.
> Then calcsize would match C's sizeof.
Maybe... AFAIK, the C language does not mandate structure padding.
Section 6.7.2.1 of the C99 standard says that "unnamed padding" may be inserted anywhere inside of a structure object *except* at the beginning. In other words, padding between members and trailing padding is very implementation specific.
When and why things are padded usually depends on the compiler, OS, and computer architecture. For example, padding between members is usually
just an optimization for architectures like x86, but on architectures like ARM and MIPS you may end up with alignment faults. Moreover, I
think trailing padding is typically added to maintain alignment for
arrays of structures, which again may be an optimization or correctness issue depending on the architecture.
So, I don't think we can match what C does since C leaves it open.
Perhaps we could match the C implementation that compiles the 'struct'
module, though? Maybe there is a way we can compute the trailing padding in a fashion similar to how alignment is computed for various
types in the '*_ALIGN' macros in '_struct.c'?
Another thing that came to mind when thinking about structure layout issues was whether it might be useful to provide "packed" structures, e.g. no padding. If the goal is to provide a way to read\write C-compatible structs, then currently there is no easy way to handle structures that have been serialized after being compiled with something like GCC's '-fpack-struct' option.
|
msg102792 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2010-04-10 20:26 |
> When and why things are padded usually depends on the compiler, OS,
> and computer architecture. [...]
Sure. The struct module has a rather simplistic set of rules for adding padding, but I believe that it manages to match the platform's C rules (assuming no relevant extra compile flags) in most cases.
If we wanted something that's *guaranteed* to match C then I agree that's probably not easy within the struct module itself; it probably belongs in ctypes somewhere instead.
For trailing padding, one fairly obvious and simple rule to use (to match the simplicity of the current rules for padding) would be to add enough padding so that the alignment of the struct matches the largest alignment of any member of the struct. So e.g. on a platform where doubles are aligned to 8 bytes and ints to 4 bytes, 'db' would have 7 bytes of trailing padding, while 'iib' would have 3 bytes. 'b' would have no trailing padding at all. I *think* (but would have to check) that this again matches what happens on common platforms.
> Another thing that came to mind when thinking about structure layout
> issues was whether it might be useful to provide "packed" structures
These already exist, provided that you're willing to accept 'standard' rather than 'native' sizes. The struct modes for '<', '>', '=' and '!' never introduce padding (unless it's explicitly asked for with 'x').
---
On the subject of documentation, I think there are two points that need to be made more clearly in the current docs:
(1) *By default* (i.e. without a '<', '>', ...), padding *is* used; but is automatically turned off with '<', '>', '=', '!'. Ideally this should be mentioned very early on, since it seems to trip up a lot of people.
(2) Padding is only added between successive struct members; there's no trailing padding. But padding can be forced using Eric's trick (though you end up having to choose exactly what alignment you want, e.g. by specifying '0i' rather than '0q' or '0d').
|
msg102805 - (view) |
Author: Meador Inge (meador.inge) * |
Date: 2010-04-11 01:41 |
> would be to add enough padding so that the alignment of the struct
> matches the largest alignment of any member of the struct.
That might work. I will have to think about it a bit.
> On the subject of documentation
Attached a doc patch which addresses Mark's two points plus the following:
(1) Organized things more clearly into sub-sections.
(2) Fixed up the "notes" column from the '__index__' change that
was submitted last week.
(3) Added some examples.
|
msg102968 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2010-04-12 20:02 |
Many thanks for the patch.
Applied (minus trailing whitespace :) in r80013.
Leaving open to remind me to merge to other branches.
|
msg102970 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2010-04-12 20:35 |
Hmm. I note that the examples are for a big-endian machine. I wonder whether they should be converted to little-endian, given that most users are on x86 or x86_64 hardware these days.
|
msg102971 - (view) |
Author: Meador Inge (meador.inge) * |
Date: 2010-04-12 20:39 |
> wonder whether they should be converted to little-endian
I thought about that as well. It sure would make creating new examples easier :) I had to construct the new examples "by hand".
I can regenerate the examples for little-endian if you would like?
|
msg102973 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2010-04-12 21:09 |
Sure, if you want to. [Then I won't be able to reproduce what's in the docs on either of my machines. :) One's 32-bit big-endian; the other's 64-bit little-endian.]
Merged your changes and subsequent tweaks to py3k in r80016.
|
msg102981 - (view) |
Author: Meador Inge (meador.inge) * |
Date: 2010-04-12 22:11 |
> Sure, if you want to. [Then I won't be able to reproduce what's in the
> docs on either of my machines. :) One's 32-bit big-endian; the other's
> 64-bit little-endian.]
>
I guess it's painful either way. I think the only fair thing to do is
provide examples for multiple scenarios, e.g. 32-bit little-endian, 64-bit
big-endian, etc... Share the pain :)
Cool, thanks! Your padding note tweak is a definite improvement over what I
originally had.
|
msg102983 - (view) |
Author: Meador Inge (meador.inge) * |
Date: 2010-04-12 22:19 |
Those last two sentences where meant to be in response to your "Merged your changes and subsequent tweaks to py3k in r80016." comment. I tried to reply to the tracker from my mail client and things got reformatted :(
|
msg106321 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2010-05-22 18:59 |
Doc changes merged in r81477 (release26-maint) and r81480 (release31-maint).
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:54 | admin | set | github: 51604 |
2013-04-02 13:37:44 | mark.dickinson | link | issue17617 superseder |
2010-05-22 18:59:00 | mark.dickinson | set | status: open -> closed resolution: fixed messages:
+ msg106321
|
2010-04-12 22:19:29 | meador.inge | set | messages:
+ msg102983 |
2010-04-12 22:11:18 | meador.inge | set | files:
+ unnamed
messages:
+ msg102981 |
2010-04-12 21:09:03 | mark.dickinson | set | messages:
+ msg102973 |
2010-04-12 20:39:38 | meador.inge | set | messages:
+ msg102971 |
2010-04-12 20:36:00 | mark.dickinson | set | messages:
+ msg102970 |
2010-04-12 20:02:13 | mark.dickinson | set | messages:
+ msg102968 versions:
- Python 2.7 |
2010-04-11 01:41:19 | meador.inge | set | files:
+ issue-7355.patch keywords:
+ patch messages:
+ msg102805
|
2010-04-10 20:26:42 | mark.dickinson | set | messages:
+ msg102792 |
2010-04-10 19:05:25 | meador.inge | set | messages:
+ msg102790 |
2010-04-10 13:04:36 | meador.inge | set | nosy:
+ meador.inge
|
2009-11-19 19:47:01 | mark.dickinson | set | messages:
+ msg95512 |
2009-11-19 19:10:20 | mark.dickinson | set | messages:
+ msg95510 |
2009-11-19 19:07:51 | eric.smith | set | messages:
+ msg95509 |
2009-11-19 19:03:18 | sgk284 | set | messages:
+ msg95508 |
2009-11-19 15:02:36 | mark.dickinson | set | status: closed -> open priority: low
components:
+ Documentation, Extension Modules, - Library (Lib) assignee: mark.dickinson versions:
+ Python 3.1, Python 2.7, Python 3.2, - Python 2.5 keywords:
+ easy resolution: not a bug -> (no value) messages:
+ msg95480 stage: resolved -> needs patch |
2009-11-19 14:33:02 | mark.dickinson | set | nosy:
+ mark.dickinson messages:
+ msg95474
|
2009-11-19 12:46:03 | eric.smith | set | status: open -> closed
nosy:
+ eric.smith messages:
+ msg95468
resolution: not a bug stage: resolved |
2009-11-19 09:42:00 | sgk284 | create | |