struct.pack() and Unicode strings #54992

dabeaz · 2010-12-27T20:09:51Z

BPO	10783
Nosy	@birkenfeld, @rhettinger, @amauryfa, @mdickinson, @vstinner, @bitdancer
Files	struct.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-12-28.13:27:48.589>
created_at = <Date 2010-12-27.20:09:50.712>
labels = ['type-bug', 'library']
title = 'struct.pack() and Unicode strings'
updated_at = <Date 2010-12-28.13:40:18.588>
user = 'https://bugs.python.org/dabeaz'

bugs.python.org fields:

activity = <Date 2010-12-28.13:40:18.588>
actor = 'dabeaz'
assignee = 'none'
closed = True
closed_date = <Date 2010-12-28.13:27:48.589>
closer = 'vstinner'
components = ['Library (Lib)']
creation = <Date 2010-12-27.20:09:50.712>
creator = 'dabeaz'
dependencies = []
files = ['20175']
hgrepos = []
issue_num = 10783
keywords = ['patch']
message_count = 23.0
messages = ['124727', '124730', '124732', '124733', '124739', '124740', '124741', '124742', '124743', '124745', '124748', '124758', '124765', '124767', '124768', '124769', '124770', '124779', '124789', '124791', '124792', '124793', '124794']
nosy_count = 7.0
nosy_names = ['georg.brandl', 'rhettinger', 'amaury.forgeotdarc', 'mark.dickinson', 'vstinner', 'r.david.murray', 'dabeaz']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue10783'
versions = ['Python 3.2']

dabeaz · 2010-12-27T20:09:51Z

Is the struct.pack() function supposed to automatically encode Unicode strings into binary? For example:

>>> struct.pack("10s","Jalape\u00f1o")
b'Jalape\xc3\xb1o\x00'
>>>

This is Python 3.2b1.

dabeaz · 2010-12-27T20:11:40Z

Note: This is what happens in Python 2.6.4:

>>> import struct
>>> struct.pack("10s",u"Jalape\u00f1o")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
struct.error: argument for 's' must be a string
>>>

dabeaz · 2010-12-27T20:16:59Z

Hmmm. Well, the docs seem to say that it's allowed and that it will be encoded as UTF-8.

Given the treatment of Unicode/bytes elsewhere in Python 3, all I can say is that this behavior is rather surprising.

bitdancer · 2010-12-27T20:18:48Z

But clearly intentional, and now enshrined in released code.

rhettinger · 2010-12-27T22:55:54Z

Can we at least offer an optional choice of encoding?

dabeaz · 2010-12-27T23:31:59Z

Why is it even encoding at all? Almost every other part of Python 3 forces you to be explicit about bytes/string conversion. For example:

struct.pack("10s", x.encode('utf-8'))

Given that automatic conversion is documented, it's not clear what can be done at this point. However, there are very few other parts of Python 3 that perform implicit string-byte conversions like this (at least that I know of off-hand).

rhettinger · 2010-12-28T00:03:37Z

Many of these kind of "decisions" were made quickly, haphazardly, and with almost no discussion and were made by contributors who were new to Python core development (no familiar with the API norms).

Given the rats nest of bytes/text problems in Py3.0 and Py3.1, I think it is fair game to fix it now. The APIs have not been shaken-out and battle-tested through wide-spread adoption, so it was fair to expect that the first experienced user to come along would find these rough patches.

ISTM, this should get fixed. The most innocuous way to do it is to add a warning for the implicit conversion. That way, any existing 3.x code (probably precious little) would continue to run. Another option is to just finish the job by adding an encoding parameter that defaults to utf-8.

rhettinger · 2010-12-28T00:14:00Z

A possible answer to "why is this encoding at all" was probably to make it easier to transition code from python 2.x where strings were usually ascii and it would make no difference in output if encoded in utf-8. The 2-to-3 fixer was good at handling name changes but not bytes/text issues. That is just a guess at what the developer may have been thinking.

dabeaz · 2010-12-28T00:21:09Z

I encountered this issue is in the context of distributed computing/interprocess communication involving binary-encoded records (and encoding/decoding such records using struct). At its core, this is all about I/O--something where encodings and decoding matter a lot. Frankly, it was quite surprising that a unicode string would silently pass through struct and turn into bytes. IMHO, the fact that this is even possible encourages a sloppy usage of struct that favors programming convenience over correctness--something that's only going to end badly for the poor soul who passes non-ASCII characters into struct without knowing it.

A default encoding might be okay as long as it was set to something like ASCII or Latin-1 (not UTF-8). At least then you'd get an encoding error for characters that don't fit into a byte.

dabeaz · 2010-12-28T00:47:20Z

Actually, here's another one of my favorite examples:

>>> import struct
>>> struct.pack("s","\xf1")
b'\xc3'
>>>

Not only does this not encode the correct value, it doesn't even encode the entire UTF-8 encoding (just the first byte of it). Like I said, pity the poor bastard who puts something that in their code and they spend the whole day trying figure out where in the hell '\xf1' magically got turned into '\xc3'.

vstinner · 2010-12-28T01:13:36Z

This "feature" was introduced in a big commit from Guido van Rossum (made before Python 3.0): r55500. The changelog is strange because it starts with "Make test_zipfile pass. The zipfile module now does all I/O in binary mode using bytes." but ends with "The _struct needed a patch to support bytes, str8 and str for the 's' and 'p' formats.". Why was _struct patched at the same time?

Implicit conversion bytes and str is a very bad idea, it is the root of all confusion related to Unicode. The experience with Python 2 demonstrated that it should be changed, and it was changed in Python 3.0. But "Python 3.0" is a big project, it has many modules. Some modules were completly broken in Python 3.0, it works better with 3.1, and we hope that it will be even better with 3.2.

Attached patch removes the implicit conversion for 'c', 's' and 'p' formats. I did a similar change in ctypes, 5 months ago: issue bpo-8966.

If a program written for Python 3.1 fails because of the patch, it can use explicit conversion to stay compatible with 3.1 and 3.2 (patched). I think that it's better to use explicit conversion.

Implicit conversion on 'c' format is really weird and it was not documented correctly: the note (1) is attached to "b" format, not to the "c" format. Example:

   >>> struct.pack('c', 'é')
   struct.error: char format requires bytes or string of length 1
   >>> len('é')
   1

There is also a length issue with the s format: struct.pack() truncates unicode string to a length in bytes, not in character, it is confusiong.

  >>> struct.pack('2s', 'ha')
   b'ha'
   >>> struct.pack('2s', 'hé')
   b'h\xc3'
   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'

Finally, I don't like implicit conversion from unicode to bytes on pack, because it's not symmetrical.

   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'
   >>> struct.unpack('3s', b'h\xc3\xa9')
   (b'h\xc3\xa9',)

(str -> pack() -> unpack() -> bytes)

bitdancer · 2010-12-28T03:47:40Z

>>> struct.pack('2s', 'ha')
   b'ha'
   >>> struct.pack('2s', 'hé')
   b'h\xc3'
   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'

That looks like a *buggy* api to me, too. I don't see how we can let that stand.

amauryfa · 2010-12-28T07:41:48Z

At this point a feature change seems unlikely, but it's not too late to emit a DeprecationWarning.

rhettinger · 2010-12-28T07:56:13Z

Errors should not pass silently :-)

Given the buggy behavior, we have several options including just removing the implicit conversion and going back to bytes only.

rhettinger · 2010-12-28T08:00:12Z

> At this point a feature change seems unlikely,

Amaury, it is not too late to fix anything that's broken. New features are out, but we are free to fix anything hosed this badly.

amauryfa · 2010-12-28T08:24:05Z

But there are probably working usages with unicode strings out there. For example, I've seen code like struct.pack('<6sHHBBB', 'GIF87a', ...)
Do you suggest to make this 3.1 code stop working in 3.2?

In any case, the 'c' format should probably be changed as well.

vstinner · 2010-12-28T09:07:11Z

Amaury> At this point a feature change seems unlikely,
Amaury> but it's not too late to emit a DeprecationWarning.

I prefer to break the API today than having to maintain a broken API for
10 or 20 years :-) And we have a very small user base using Python 3,
it's easier to change it now, than in the next release.

Amaury> But there are probably working usages with unicode strings out
Amaury> there. For example, I've seen code like
Amaury> struct.pack('<6sHHBBB','GIF87a', ...)
Amaury> Do you suggest to make this 3.1 code stop working in 3.2?

Yes. As I wrote in my previous message, this can be changed to
struct.pack('<6sHHBBB', b'GIF87a', ...), which works on 3.1 and 3.2
(patched).

birkenfeld · 2010-12-28T10:25:46Z

I agree this automatic conversion is broken and should be fixed. Not sure if emitting a DeprecationWarning now and fixing it 18 months later is the right thing, especially since DeprecationWarnings are now silent.
As Victor says, the incompatibility is explicit, and the fix is simple and fully backwards compatible, while the current behavior will cause nasty intermittent surprises.

amauryfa · 2010-12-28T12:09:58Z

Since the Release Manager agrees with the change, I withdraw my objection.

I have three remarks to the patch:

- Some examples in the documentation should be fixed:
http://docs.python.org/dev/py3k/library/struct.html#examples
>>> pack('ci', '*', 0x12131415)
will now raise an exception.

the message "argument for 's' must be a bytes" looks a bit weird. "a bytes object" seems better.
the 'p' format (Pascal String) should be changed as well. This is the last call to _PyUnicode_AsDefaultEncodedString() in this file...

dabeaz · 2010-12-28T13:16:01Z

As a user of Python 3, I would like echo Victor's comment about fixing the API right now as opposed to having to deal with it later. I can only speak for myself, but I would guess that anyone using Python 3 already understands that it's bleeding edge and that the bytes/strings distinction is really important. If fixing this breaks some third party libraries, I say good--they shouldn't have been blindly passing Unicode into struct in the first place. Better to deal with it now when the number of users is relatively small.

vstinner · 2010-12-28T13:27:49Z

Fixed by r87537. Thanks Amaury for your review!

I also removed some ugly (implicit) conversions from test_struct.

birkenfeld · 2010-12-28T13:32:45Z

Thanks, Victor!

dabeaz · 2010-12-28T13:40:19Z

Thanks everyone for looking at this!

dabeaz mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 27, 2010

bitdancer closed this as completed Dec 27, 2010

bitdancer added the invalid label Dec 27, 2010

vstinner reopened this Dec 28, 2010

vstinner removed the invalid label Dec 28, 2010

vstinner closed this as completed Dec 28, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

struct.pack() and Unicode strings #54992

struct.pack() and Unicode strings #54992

dabeaz mannequin commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

bitdancer commented Dec 27, 2010

rhettinger commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

rhettinger commented Dec 28, 2010

rhettinger commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

vstinner commented Dec 28, 2010

bitdancer commented Dec 28, 2010

amauryfa commented Dec 28, 2010

rhettinger commented Dec 28, 2010

rhettinger commented Dec 28, 2010

amauryfa commented Dec 28, 2010

vstinner commented Dec 28, 2010

birkenfeld commented Dec 28, 2010

amauryfa commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

vstinner commented Dec 28, 2010

birkenfeld commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

struct.pack() and Unicode strings #54992

struct.pack() and Unicode strings #54992

Comments

dabeaz mannequin commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

bitdancer commented Dec 27, 2010

rhettinger commented Dec 27, 2010

dabeaz mannequin commented Dec 27, 2010

rhettinger commented Dec 28, 2010

rhettinger commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

vstinner commented Dec 28, 2010

bitdancer commented Dec 28, 2010

amauryfa commented Dec 28, 2010

rhettinger commented Dec 28, 2010

rhettinger commented Dec 28, 2010

amauryfa commented Dec 28, 2010

vstinner commented Dec 28, 2010

birkenfeld commented Dec 28, 2010

amauryfa commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010

vstinner commented Dec 28, 2010

birkenfeld commented Dec 28, 2010

dabeaz mannequin commented Dec 28, 2010