classification
Title: struct.pack() and Unicode strings
Type: behavior Stage: committed/rejected
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, dabeaz, georg.brandl, haypo, mark.dickinson, r.david.murray, rhettinger
Priority: normal Keywords: patch

Created on 2010-12-27 20:09 by dabeaz, last changed 2010-12-28 13:40 by dabeaz. This issue is now closed.

Files
File name Uploaded Description Edit
struct.patch haypo, 2010-12-28 01:13
Messages (23)
msg124727 - (view) Author: David Beazley (dabeaz) Date: 2010-12-27 20:09
Is the struct.pack() function supposed to automatically encode Unicode strings into binary?  For example:

>>> struct.pack("10s","Jalape\u00f1o")
b'Jalape\xc3\xb1o\x00'
>>>

This is Python 3.2b1.
msg124730 - (view) Author: David Beazley (dabeaz) Date: 2010-12-27 20:11
Note: This is what happens in Python 2.6.4:

>>> import struct
>>> struct.pack("10s",u"Jalape\u00f1o")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
struct.error: argument for 's' must be a string
>>>
msg124732 - (view) Author: David Beazley (dabeaz) Date: 2010-12-27 20:16
Hmmm. Well, the docs seem to say that it's allowed and that it will be encoded as UTF-8.  

Given the treatment of Unicode/bytes elsewhere in Python 3, all I can say is that this behavior is rather surprising.
msg124733 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-27 20:18
But clearly intentional, and now enshrined in released code.
msg124739 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-12-27 22:55
Can we at least offer an optional choice of encoding?
msg124740 - (view) Author: David Beazley (dabeaz) Date: 2010-12-27 23:31
Why is it even encoding at all?  Almost every other part of Python 3 forces you to be explicit about bytes/string conversion.  For example:

struct.pack("10s", x.encode('utf-8'))

Given that automatic conversion is documented, it's not clear what can be done at this point.  However, there are very few other parts of Python 3 that perform implicit string-byte conversions like this (at least that I know of off-hand).
msg124741 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-12-28 00:03
Many of these kind of "decisions" were made quickly, haphazardly, and with almost no discussion and were made by contributors who were new to Python core development (no familiar with the API norms).

Given the rats nest of bytes/text problems in Py3.0 and Py3.1, I think it is fair game to fix it now.  The APIs have not been shaken-out and battle-tested through wide-spread adoption, so it was fair to expect that the first experienced user to come along would find these rough patches.  

ISTM, this should get fixed.  The most innocuous way to do it is to add a warning for the implicit conversion.  That way, any existing 3.x code (probably precious little) would continue to run.  Another option is to just finish the job by adding an encoding parameter that defaults to utf-8.
msg124742 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-12-28 00:13
A possible answer to "why is this encoding at all" was probably to make it easier to transition code from python 2.x where strings were usually ascii and it would make no difference in output if encoded in utf-8.  The 2-to-3 fixer was good at handling name changes but not bytes/text issues.  That is just a guess at what the developer may have been thinking.
msg124743 - (view) Author: David Beazley (dabeaz) Date: 2010-12-28 00:21
I encountered this issue is in the context of distributed computing/interprocess communication involving binary-encoded records (and encoding/decoding such records using struct). At its core, this is all about I/O--something where encodings and decoding matter a lot.  Frankly, it was quite surprising that a unicode string would silently pass through struct and turn into bytes.  IMHO, the fact that this is even possible encourages a sloppy usage of struct that favors programming convenience over correctness--something that's only going to end badly for the poor soul who passes non-ASCII characters into struct without knowing it. 

A default encoding might be okay as long as it was set to something like ASCII or Latin-1 (not UTF-8).  At least then you'd get an encoding error for characters that don't fit into a byte.
msg124745 - (view) Author: David Beazley (dabeaz) Date: 2010-12-28 00:47
Actually, here's another one of my favorite examples:

>>> import struct
>>> struct.pack("s","\xf1")
b'\xc3'
>>> 

Not only does this not encode the correct value, it doesn't even encode the entire UTF-8 encoding (just the first byte of it).   Like I said, pity the poor bastard who puts something that in their code and they spend the whole day trying figure out where in the hell '\xf1' magically got turned into '\xc3'.
msg124748 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-12-28 01:13
This "feature" was introduced in a big commit from Guido van Rossum (made before Python 3.0): r55500. The changelog is strange because it starts with "Make test_zipfile pass. The zipfile module now does all I/O in binary mode using bytes." but ends with "The _struct needed a patch to support bytes, str8 and str for the 's' and 'p' formats.". Why was _struct patched at the same time?

Implicit conversion bytes and str is a very bad idea, it is the root of all confusion related to Unicode. The experience with Python 2 demonstrated that it should be changed, and it was changed in Python 3.0. But "Python 3.0" is a big project, it has many modules. Some modules were completly broken in Python 3.0, it works better with 3.1, and we hope that it will be even better with 3.2.

Attached patch removes the implicit conversion for 'c', 's' and 'p' formats. I did a similar change in ctypes, 5 months ago: issue #8966.

If a program written for Python 3.1 fails because of the patch, it can use explicit conversion to stay compatible with 3.1 and 3.2 (patched). I think that it's better to use explicit conversion.

Implicit conversion on 'c' format is really weird and it was not documented correctly: the note (1) is attached to "b" format, not to the "c" format. Example:

   >>> struct.pack('c', 'é')
   struct.error: char format requires bytes or string of length 1
   >>> len('é')
   1

There is also a length issue with the s format: struct.pack() truncates unicode string to a length in bytes, not in character, it is confusiong.

  >>> struct.pack('2s', 'ha')
   b'ha'
   >>> struct.pack('2s', 'hé')
   b'h\xc3'
   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'

Finally, I don't like implicit conversion from unicode to bytes on pack, because it's not symmetrical.

   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'
   >>> struct.unpack('3s', b'h\xc3\xa9')
   (b'h\xc3\xa9',)

(str -> pack() -> unpack() -> bytes)
msg124758 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-28 03:47
>>> struct.pack('2s', 'ha')
   b'ha'
   >>> struct.pack('2s', 'hé')
   b'h\xc3'
   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'

That looks like a *buggy* api to me, too.  I don't see how we can let that stand.
msg124765 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-12-28 07:41
At this point a feature change seems unlikely, but it's not too late to emit a DeprecationWarning.
msg124767 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-12-28 07:56
Errors should not pass silently :-)

Given the buggy behavior, we have several options including just removing the implicit conversion and going back to bytes only.
msg124768 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-12-28 08:00
>> At this point a feature change seems unlikely,

Amaury, it is not too late to fix anything that's broken.  New features are out, but we are free to fix anything hosed this badly.
msg124769 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-12-28 08:24
But there are probably working usages with unicode strings out there. For example, I've seen code like struct.pack('<6sHHBBB', 'GIF87a', ...)
Do you suggest to make this 3.1 code stop working in 3.2?

In any case, the 'c' format should probably be changed as well.
msg124770 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-12-28 09:07
Amaury> At this point a feature change seems unlikely,
Amaury> but it's not too late to emit a DeprecationWarning.

I prefer to break the API today than having to maintain a broken API for
10 or 20 years :-)  And we have a very small user base using Python 3,
it's easier to change it now, than in the next release.

Amaury> But there are probably working usages with unicode strings out
Amaury> there. For example, I've seen code like
Amaury> struct.pack('<6sHHBBB','GIF87a', ...)
Amaury> Do you suggest to make this 3.1 code stop working in 3.2?

Yes. As I wrote in my previous message, this can be changed to
struct.pack('<6sHHBBB', b'GIF87a', ...), which works on 3.1 and 3.2
(patched).
msg124779 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-12-28 10:25
I agree this automatic conversion is broken and should be fixed.  Not sure if emitting a DeprecationWarning now and fixing it 18 months later is the right thing, especially since DeprecationWarnings are now silent.
As Victor says, the incompatibility is explicit, and the fix is simple and fully backwards compatible, while the current behavior will cause nasty intermittent surprises.
msg124789 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-12-28 12:09
Since the Release Manager agrees with the change, I withdraw my objection.

I have three remarks to the patch: 

- Some examples in the documentation should be fixed:
http://docs.python.org/dev/py3k/library/struct.html#examples
>>> pack('ci', '*', 0x12131415)
will now raise an exception.

- the message "argument for 's' must be a bytes" looks a bit weird. "a bytes object" seems better.

- the 'p' format (Pascal String) should be changed as well. This is the last call to _PyUnicode_AsDefaultEncodedString() in this file...
msg124791 - (view) Author: David Beazley (dabeaz) Date: 2010-12-28 13:16
As a user of Python 3, I would like echo Victor's comment about fixing the API right now as opposed to having to deal with it later.  I can only speak for myself, but I would guess that anyone using Python 3 already understands that it's bleeding edge and that the bytes/strings distinction is really important.  If fixing this breaks some third party libraries, I say good--they shouldn't have been blindly passing Unicode into struct in the first place.  Better to deal with it now when the number of users is relatively small.
msg124792 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-12-28 13:27
Fixed by r87537. Thanks Amaury for your review!

I also removed some ugly (implicit) conversions from test_struct.
msg124793 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-12-28 13:32
Thanks, Victor!
msg124794 - (view) Author: David Beazley (dabeaz) Date: 2010-12-28 13:40
Thanks everyone for looking at this!
History
Date User Action Args
2010-12-28 13:40:18dabeazsetnosy: georg.brandl, rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124794
2010-12-28 13:32:45georg.brandlsetnosy: georg.brandl, rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124793
2010-12-28 13:27:48hayposetstatus: open -> closed

messages: + msg124792
resolution: fixed
nosy: georg.brandl, rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
2010-12-28 13:16:00dabeazsetnosy: georg.brandl, rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124791
2010-12-28 12:09:57amaury.forgeotdarcsetnosy: georg.brandl, rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124789
2010-12-28 10:25:45georg.brandlsetnosy: + georg.brandl
messages: + msg124779
2010-12-28 09:07:11hayposetnosy: rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124770
2010-12-28 08:24:05amaury.forgeotdarcsetnosy: rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124769
2010-12-28 08:00:12rhettingersetnosy: rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124768
2010-12-28 07:56:13rhettingersetnosy: rhettinger, amaury.forgeotdarc, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124767
2010-12-28 07:41:47amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg124765
2010-12-28 03:47:40r.david.murraysetnosy: rhettinger, mark.dickinson, haypo, r.david.murray, dabeaz
messages: + msg124758
2010-12-28 01:14:44hayposetstatus: closed -> open
nosy: rhettinger, mark.dickinson, haypo, r.david.murray, dabeaz
resolution: not a bug -> (no value)
2010-12-28 01:13:37hayposetfiles: + struct.patch

nosy: + haypo
messages: + msg124748

keywords: + patch
2010-12-28 00:47:20dabeazsetnosy: rhettinger, mark.dickinson, r.david.murray, dabeaz
messages: + msg124745
2010-12-28 00:21:08dabeazsetnosy: rhettinger, mark.dickinson, r.david.murray, dabeaz
messages: + msg124743
2010-12-28 00:13:59rhettingersetnosy: rhettinger, mark.dickinson, r.david.murray, dabeaz
messages: + msg124742
2010-12-28 00:03:36rhettingersetnosy: rhettinger, mark.dickinson, r.david.murray, dabeaz
messages: + msg124741
2010-12-27 23:31:59dabeazsetnosy: rhettinger, mark.dickinson, r.david.murray, dabeaz
messages: + msg124740
2010-12-27 22:55:54rhettingersetnosy: + rhettinger
messages: + msg124739
2010-12-27 20:18:47r.david.murraysetstatus: open -> closed

nosy: + mark.dickinson, r.david.murray
messages: + msg124733

resolution: not a bug
stage: committed/rejected
2010-12-27 20:16:59dabeazsetmessages: + msg124732
2010-12-27 20:11:39dabeazsetmessages: + msg124730
2010-12-27 20:09:50dabeazcreate