Message 124748 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	dabeaz, mark.dickinson, r.david.murray, rhettinger, vstinner
Date	2010-12-28.01:13:36
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1293498819.06.0.567356166964.issue10783@psf.upfronthosting.co.za>
In-reply-to

Content
This "feature" was introduced in a big commit from Guido van Rossum (made before Python 3.0): r55500. The changelog is strange because it starts with "Make test_zipfile pass. The zipfile module now does all I/O in binary mode using bytes." but ends with "The _struct needed a patch to support bytes, str8 and str for the 's' and 'p' formats.". Why was _struct patched at the same time? Implicit conversion bytes and str is a very bad idea, it is the root of all confusion related to Unicode. The experience with Python 2 demonstrated that it should be changed, and it was changed in Python 3.0. But "Python 3.0" is a big project, it has many modules. Some modules were completly broken in Python 3.0, it works better with 3.1, and we hope that it will be even better with 3.2. Attached patch removes the implicit conversion for 'c', 's' and 'p' formats. I did a similar change in ctypes, 5 months ago: issue #8966. If a program written for Python 3.1 fails because of the patch, it can use explicit conversion to stay compatible with 3.1 and 3.2 (patched). I think that it's better to use explicit conversion. Implicit conversion on 'c' format is really weird and it was not documented correctly: the note (1) is attached to "b" format, not to the "c" format. Example: >>> struct.pack('c', 'é') struct.error: char format requires bytes or string of length 1 >>> len('é') 1 There is also a length issue with the s format: struct.pack() truncates unicode string to a length in bytes, not in character, it is confusiong. >>> struct.pack('2s', 'ha') b'ha' >>> struct.pack('2s', 'hé') b'h\xc3' >>> struct.pack('3s', 'hé') b'h\xc3\xa9' Finally, I don't like implicit conversion from unicode to bytes on pack, because it's not symmetrical. >>> struct.pack('3s', 'hé') b'h\xc3\xa9' >>> struct.unpack('3s', b'h\xc3\xa9') (b'h\xc3\xa9',) (str -> pack() -> unpack() -> bytes)

This "feature" was introduced in a big commit from Guido van Rossum (made before Python 3.0): r55500. The changelog is strange because it starts with "Make test_zipfile pass. The zipfile module now does all I/O in binary mode using bytes." but ends with "The _struct needed a patch to support bytes, str8 and str for the 's' and 'p' formats.". Why was _struct patched at the same time?

Implicit conversion bytes and str is a very bad idea, it is the root of all confusion related to Unicode. The experience with Python 2 demonstrated that it should be changed, and it was changed in Python 3.0. But "Python 3.0" is a big project, it has many modules. Some modules were completly broken in Python 3.0, it works better with 3.1, and we hope that it will be even better with 3.2.

Attached patch removes the implicit conversion for 'c', 's' and 'p' formats. I did a similar change in ctypes, 5 months ago: issue #8966.

If a program written for Python 3.1 fails because of the patch, it can use explicit conversion to stay compatible with 3.1 and 3.2 (patched). I think that it's better to use explicit conversion.

Implicit conversion on 'c' format is really weird and it was not documented correctly: the note (1) is attached to "b" format, not to the "c" format. Example:

   >>> struct.pack('c', 'é')
   struct.error: char format requires bytes or string of length 1
   >>> len('é')
   1

There is also a length issue with the s format: struct.pack() truncates unicode string to a length in bytes, not in character, it is confusiong.

  >>> struct.pack('2s', 'ha')
   b'ha'
   >>> struct.pack('2s', 'hé')
   b'h\xc3'
   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'

Finally, I don't like implicit conversion from unicode to bytes on pack, because it's not symmetrical.

   >>> struct.pack('3s', 'hé')
   b'h\xc3\xa9'
   >>> struct.unpack('3s', b'h\xc3\xa9')
   (b'h\xc3\xa9',)

(str -> pack() -> unpack() -> bytes)

History
Date	User	Action	Args
2010-12-28 01:13:39	vstinner	set	recipients: + vstinner, rhettinger, mark.dickinson, r.david.murray, dabeaz
2010-12-28 01:13:39	vstinner	set	messageid: <1293498819.06.0.567356166964.issue10783@psf.upfronthosting.co.za>
2010-12-28 01:13:37	vstinner	link	issue10783 messages
2010-12-28 01:13:36	vstinner	create