Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struct.pack() and Unicode strings #54992

Closed
dabeaz mannequin opened this issue Dec 27, 2010 · 23 comments
Closed

struct.pack() and Unicode strings #54992

dabeaz mannequin opened this issue Dec 27, 2010 · 23 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dabeaz
Copy link
Mannequin

dabeaz mannequin commented Dec 27, 2010

BPO 10783
Nosy @birkenfeld, @rhettinger, @amauryfa, @mdickinson, @vstinner, @bitdancer
Files
  • struct.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-12-28.13:27:48.589>
    created_at = <Date 2010-12-27.20:09:50.712>
    labels = ['type-bug', 'library']
    title = 'struct.pack() and Unicode strings'
    updated_at = <Date 2010-12-28.13:40:18.588>
    user = 'https://bugs.python.org/dabeaz'

    bugs.python.org fields:

    activity = <Date 2010-12-28.13:40:18.588>
    actor = 'dabeaz'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-12-28.13:27:48.589>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2010-12-27.20:09:50.712>
    creator = 'dabeaz'
    dependencies = []
    files = ['20175']
    hgrepos = []
    issue_num = 10783
    keywords = ['patch']
    message_count = 23.0
    messages = ['124727', '124730', '124732', '124733', '124739', '124740', '124741', '124742', '124743', '124745', '124748', '124758', '124765', '124767', '124768', '124769', '124770', '124779', '124789', '124791', '124792', '124793', '124794']
    nosy_count = 7.0
    nosy_names = ['georg.brandl', 'rhettinger', 'amaury.forgeotdarc', 'mark.dickinson', 'vstinner', 'r.david.murray', 'dabeaz']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue10783'
    versions = ['Python 3.2']

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 27, 2010

    Is the struct.pack() function supposed to automatically encode Unicode strings into binary? For example:

    >>> struct.pack("10s","Jalape\u00f1o")
    b'Jalape\xc3\xb1o\x00'
    >>>

    This is Python 3.2b1.

    @dabeaz dabeaz mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 27, 2010
    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 27, 2010

    Note: This is what happens in Python 2.6.4:

    >>> import struct
    >>> struct.pack("10s",u"Jalape\u00f1o")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    struct.error: argument for 's' must be a string
    >>>

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 27, 2010

    Hmmm. Well, the docs seem to say that it's allowed and that it will be encoded as UTF-8.

    Given the treatment of Unicode/bytes elsewhere in Python 3, all I can say is that this behavior is rather surprising.

    @bitdancer
    Copy link
    Member

    But clearly intentional, and now enshrined in released code.

    @rhettinger
    Copy link
    Contributor

    Can we at least offer an optional choice of encoding?

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 27, 2010

    Why is it even encoding at all? Almost every other part of Python 3 forces you to be explicit about bytes/string conversion. For example:

    struct.pack("10s", x.encode('utf-8'))

    Given that automatic conversion is documented, it's not clear what can be done at this point. However, there are very few other parts of Python 3 that perform implicit string-byte conversions like this (at least that I know of off-hand).

    @rhettinger
    Copy link
    Contributor

    Many of these kind of "decisions" were made quickly, haphazardly, and with almost no discussion and were made by contributors who were new to Python core development (no familiar with the API norms).

    Given the rats nest of bytes/text problems in Py3.0 and Py3.1, I think it is fair game to fix it now. The APIs have not been shaken-out and battle-tested through wide-spread adoption, so it was fair to expect that the first experienced user to come along would find these rough patches.

    ISTM, this should get fixed. The most innocuous way to do it is to add a warning for the implicit conversion. That way, any existing 3.x code (probably precious little) would continue to run. Another option is to just finish the job by adding an encoding parameter that defaults to utf-8.

    @rhettinger
    Copy link
    Contributor

    A possible answer to "why is this encoding at all" was probably to make it easier to transition code from python 2.x where strings were usually ascii and it would make no difference in output if encoded in utf-8. The 2-to-3 fixer was good at handling name changes but not bytes/text issues. That is just a guess at what the developer may have been thinking.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 28, 2010

    I encountered this issue is in the context of distributed computing/interprocess communication involving binary-encoded records (and encoding/decoding such records using struct). At its core, this is all about I/O--something where encodings and decoding matter a lot. Frankly, it was quite surprising that a unicode string would silently pass through struct and turn into bytes. IMHO, the fact that this is even possible encourages a sloppy usage of struct that favors programming convenience over correctness--something that's only going to end badly for the poor soul who passes non-ASCII characters into struct without knowing it.

    A default encoding might be okay as long as it was set to something like ASCII or Latin-1 (not UTF-8). At least then you'd get an encoding error for characters that don't fit into a byte.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 28, 2010

    Actually, here's another one of my favorite examples:

    >>> import struct
    >>> struct.pack("s","\xf1")
    b'\xc3'
    >>> 

    Not only does this not encode the correct value, it doesn't even encode the entire UTF-8 encoding (just the first byte of it). Like I said, pity the poor bastard who puts something that in their code and they spend the whole day trying figure out where in the hell '\xf1' magically got turned into '\xc3'.

    @vstinner
    Copy link
    Member

    This "feature" was introduced in a big commit from Guido van Rossum (made before Python 3.0): r55500. The changelog is strange because it starts with "Make test_zipfile pass. The zipfile module now does all I/O in binary mode using bytes." but ends with "The _struct needed a patch to support bytes, str8 and str for the 's' and 'p' formats.". Why was _struct patched at the same time?

    Implicit conversion bytes and str is a very bad idea, it is the root of all confusion related to Unicode. The experience with Python 2 demonstrated that it should be changed, and it was changed in Python 3.0. But "Python 3.0" is a big project, it has many modules. Some modules were completly broken in Python 3.0, it works better with 3.1, and we hope that it will be even better with 3.2.

    Attached patch removes the implicit conversion for 'c', 's' and 'p' formats. I did a similar change in ctypes, 5 months ago: issue bpo-8966.

    If a program written for Python 3.1 fails because of the patch, it can use explicit conversion to stay compatible with 3.1 and 3.2 (patched). I think that it's better to use explicit conversion.

    Implicit conversion on 'c' format is really weird and it was not documented correctly: the note (1) is attached to "b" format, not to the "c" format. Example:

       >>> struct.pack('c', 'é')
       struct.error: char format requires bytes or string of length 1
       >>> len('é')
       1

    There is also a length issue with the s format: struct.pack() truncates unicode string to a length in bytes, not in character, it is confusiong.

      >>> struct.pack('2s', 'ha')
       b'ha'
       >>> struct.pack('2s', 'hé')
       b'h\xc3'
       >>> struct.pack('3s', 'hé')
       b'h\xc3\xa9'

    Finally, I don't like implicit conversion from unicode to bytes on pack, because it's not symmetrical.

       >>> struct.pack('3s', 'hé')
       b'h\xc3\xa9'
       >>> struct.unpack('3s', b'h\xc3\xa9')
       (b'h\xc3\xa9',)

    (str -> pack() -> unpack() -> bytes)

    @vstinner vstinner reopened this Dec 28, 2010
    @vstinner vstinner removed the invalid label Dec 28, 2010
    @bitdancer
    Copy link
    Member

    >>> struct.pack('2s', 'ha')
       b'ha'
       >>> struct.pack('2s', 'hé')
       b'h\xc3'
       >>> struct.pack('3s', 'hé')
       b'h\xc3\xa9'

    That looks like a *buggy* api to me, too. I don't see how we can let that stand.

    @amauryfa
    Copy link
    Member

    At this point a feature change seems unlikely, but it's not too late to emit a DeprecationWarning.

    @rhettinger
    Copy link
    Contributor

    Errors should not pass silently :-)

    Given the buggy behavior, we have several options including just removing the implicit conversion and going back to bytes only.

    @rhettinger
    Copy link
    Contributor

    > At this point a feature change seems unlikely,

    Amaury, it is not too late to fix anything that's broken. New features are out, but we are free to fix anything hosed this badly.

    @amauryfa
    Copy link
    Member

    But there are probably working usages with unicode strings out there. For example, I've seen code like struct.pack('<6sHHBBB', 'GIF87a', ...)
    Do you suggest to make this 3.1 code stop working in 3.2?

    In any case, the 'c' format should probably be changed as well.

    @vstinner
    Copy link
    Member

    Amaury> At this point a feature change seems unlikely,
    Amaury> but it's not too late to emit a DeprecationWarning.

    I prefer to break the API today than having to maintain a broken API for
    10 or 20 years :-) And we have a very small user base using Python 3,
    it's easier to change it now, than in the next release.

    Amaury> But there are probably working usages with unicode strings out
    Amaury> there. For example, I've seen code like
    Amaury> struct.pack('<6sHHBBB','GIF87a', ...)
    Amaury> Do you suggest to make this 3.1 code stop working in 3.2?

    Yes. As I wrote in my previous message, this can be changed to
    struct.pack('<6sHHBBB', b'GIF87a', ...), which works on 3.1 and 3.2
    (patched).

    @birkenfeld
    Copy link
    Member

    I agree this automatic conversion is broken and should be fixed. Not sure if emitting a DeprecationWarning now and fixing it 18 months later is the right thing, especially since DeprecationWarnings are now silent.
    As Victor says, the incompatibility is explicit, and the fix is simple and fully backwards compatible, while the current behavior will cause nasty intermittent surprises.

    @amauryfa
    Copy link
    Member

    Since the Release Manager agrees with the change, I withdraw my objection.

    I have three remarks to the patch:

    - Some examples in the documentation should be fixed:
    http://docs.python.org/dev/py3k/library/struct.html#examples
    >>> pack('ci', '*', 0x12131415)
    will now raise an exception.
    • the message "argument for 's' must be a bytes" looks a bit weird. "a bytes object" seems better.

    • the 'p' format (Pascal String) should be changed as well. This is the last call to _PyUnicode_AsDefaultEncodedString() in this file...

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 28, 2010

    As a user of Python 3, I would like echo Victor's comment about fixing the API right now as opposed to having to deal with it later. I can only speak for myself, but I would guess that anyone using Python 3 already understands that it's bleeding edge and that the bytes/strings distinction is really important. If fixing this breaks some third party libraries, I say good--they shouldn't have been blindly passing Unicode into struct in the first place. Better to deal with it now when the number of users is relatively small.

    @vstinner
    Copy link
    Member

    Fixed by r87537. Thanks Amaury for your review!

    I also removed some ugly (implicit) conversions from test_struct.

    @birkenfeld
    Copy link
    Member

    Thanks, Victor!

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Dec 28, 2010

    Thanks everyone for looking at this!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants