Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a buffer from a Unicode array uses invalid format #57281

Closed
vstinner opened this issue Sep 30, 2011 · 45 comments
Closed

Getting a buffer from a Unicode array uses invalid format #57281

vstinner opened this issue Sep 30, 2011 · 45 comments
Assignees
Labels
release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 13072
Nosy @loewis, @birkenfeld, @mdickinson, @ncoghlan, @pitrou, @vstinner, @skrah, @meadori
Files
  • array_revert_pep393.patch
  • array_revert_pep393-2.patch
  • array_unicode_format.patch
  • array_deprecate_u.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = <Date 2012-08-24.18:22:37.401>
    created_at = <Date 2011-09-30.00:09:50.065>
    labels = ['type-bug', 'library', 'release-blocker']
    title = 'Getting a buffer from a Unicode array uses invalid format'
    updated_at = <Date 2012-08-24.18:22:37.400>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2012-08-24.18:22:37.400>
    actor = 'skrah'
    assignee = 'vstinner'
    closed = True
    closed_date = <Date 2012-08-24.18:22:37.401>
    closer = 'skrah'
    components = ['Library (Lib)']
    creation = <Date 2011-09-30.00:09:50.065>
    creator = 'vstinner'
    dependencies = []
    files = ['26646', '26647', '26704', '26892']
    hgrepos = []
    issue_num = 13072
    keywords = ['patch']
    message_count = 45.0
    messages = ['144658', '144812', '144814', '144817', '144818', '158381', '158892', '167091', '167109', '167112', '167119', '167122', '167165', '167173', '167520', '167521', '167522', '167540', '167545', '167546', '167547', '167549', '167551', '167561', '167566', '167571', '167673', '167702', '167703', '167708', '167732', '167936', '167947', '167997', '168005', '168369', '168373', '168558', '168561', '168567', '168571', '168575', '169026', '169063', '169065']
    nosy_count = 10.0
    nosy_names = ['loewis', 'georg.brandl', 'mark.dickinson', 'ncoghlan', 'pitrou', 'vstinner', 'Arfrever', 'skrah', 'meador.inge', 'python-dev']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13072'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    In Python 3.2, when you get a buffer from array.array('u'), "u" is used as buffer format. The format is supposed to be a format from the struct module, and "u" is an invalid struct format. "w" is used on wide mode.

    I just upgraded the array module to use the new Unicode API (PEP-393). The array now uses a Py_UCS4 buffer. I used "I" or "L" format depending on the size of int and long C types.

    It would be better to use a format for a Py_UCS4 string, but struct doesn't support such type.

    For Python 2.7 and 3.2, I don't know if it is really a bug or not.

    @vstinner vstinner added the stdlib Python modules in the Lib dir label Sep 30, 2011
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 3, 2011

    The automatic conversion of 'u' to 'I' or 'L' causes test_buffer
    (PEP-3118 repo) to fail:

    # Not implemented formats. Ugly, but inevitable. This is the same as
    # issue python/cpython#46783: equality is also used for membership testing and must
    # return a result.
    a = array.array('u', 'xyz')
    v = memoryview(a)
    self.assertNotEqual(v, a)
    self.assertNotEqual(a, v)

    I don't have a better idea though what to do about 'u' except
    officially implementing it for struct and memoryview as well.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 3, 2011

    It would be better to use a format for a Py_UCS4 string, but struct doesn't support such type.

    PEP-3118 suggests for the extended struct syntax:

    'c' -> ucs-1 (latin-1) encoding
    'u' -> ucs-2
    'w' -> ucs-4

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 3, 2011

    The automatic conversion of 'u' to 'I' or 'L' causes test_buffer
    (PEP-3118 repo) to fail:

    Not implemented formats. Ugly, but inevitable. This is the same as

    issue bpo-2531: equality is also used for membership testing and must

    return a result.

    a = array.array('u', 'xyz')
    v = memoryview(a)
    self.assertNotEqual(v, a)
    self.assertNotEqual(a, v)

    I don't understand: a buffer format is a format for the struct module,
    or for the array module?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 3, 2011

    STINNER Victor <report@bugs.python.org> wrote:

    > # Not implemented formats. Ugly, but inevitable. This is the same as
    > # issue bpo-2531: equality is also used for membership testing and must
    > # return a result.
    > a = array.array('u', 'xyz')
    > v = memoryview(a)
    > self.assertNotEqual(v, a)
    > self.assertNotEqual(a, v)

    I don't understand: a buffer format is a format for the struct module,
    or for the array module?

    It's like this: memoryview follows the current struct syntax, which
    doesn't have 'u'. memory_richcompare() does not understand 'u', but
    is required to return something for __eq__ and __ne__, so it returns
    'not equal'.

    This isn't so important, since I discovered (see my later post)
    that 'u' and 'w' were scheduled for inclusion in the struct
    module anyway.

    So I think we should focus on whether the proposed 'c', 'u' and 'w'
    format specifiers still make sense after the PEP-393 changes.

    @vstinner
    Copy link
    Member Author

    @Stefan: What is the status of this issue?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 20, 2012

    I'm not sure what to do. Martin's opinion was that the change should
    be reverted:

    http://mail.python.org/pipermail/python-dev/2012-March/117390.html

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 1, 2012

    Should we do something before Python 3.3 final?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 1, 2012

    Is it possible without too much effort to keep the old behavior
    ('u' -> Py_UNICODE)? Then I'd say that should go into 3.3.

    The problem with the current behavior is that it's neither backwards
    compatible nor PEP-3118 compliant.

    If it is too much work to restore the status quo, we could leave this
    change with the excuse that 'u' is probably not used very often.

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 1, 2012

    Here is a patch reverting changes of the PEP-393, as suggested by Martin von Loewis. With the patch, array uses Py_UNICODE* type for the 'u' format. So array.array('u', '\u0010ffff')[0] should return '\uDBFF' on Windows.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 1, 2012

    The diff between b9558df8cc58 and default with array_revert_pep393.patch
    applied is small, but I noticed that in some places you switched back to
    Py_UNICODE typecode and in others not. For instance, in struct arraydescr
    typecode is still char.

    I'm not sure why typecode was originally Py_UNICODE though.

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 1, 2012

    The diff between b9558df8cc58 and default with array_revert_pep393.patch
    applied is small, but I noticed that in some places you switched back to
    Py_UNICODE typecode and in others not.

    I just copied code from Python 3.2, I forgot to update typecode type
    (Py_UNICODE => char). I attach a new patch which changes also the
    documentation of the "u" format.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 1, 2012

    array_revert_pep393-2.patch looks good (checked against 7042a83f37e
    and all following commits that should be kept).

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 1, 2012

    @georg: are you ok with this change? It reverts the behaviour of Python 3.2 and avoids to have to maintain an API that nobody wants to use ('u' format using Py_UCS4, 32 bits unsigned).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 5, 2012

    New changeset 95da47ddebe0 by Victor Stinner in branch 'default':
    Close bpo-13072: Restore code before the PEP-393 for the array module
    http://hg.python.org/cpython/rev/95da47ddebe0

    @python-dev python-dev mannequin closed this as completed Aug 5, 2012
    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 5, 2012

    Oops, the initial issue is not solved. Attached fixes the array == memoryview issue by using a valid format for the buffer.

    @vstinner vstinner reopened this Aug 5, 2012
    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 5, 2012

    Hum, this issue is a regression from Python 3.2. I would like to see it fixed in Python 3.3. Example:

    Python 3.2.3+ (3.2:243ad1a6f638+, Aug  4 2012, 01:36:41) 
    [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
    >>> import array
    >>> a=array.array('u', 'xyz')
    >>> b=memoryview(a)
    >>> a == b
    True
    >>> b == a
    True

    @birkenfeld
    Copy link
    Member

    Victor: the revert commit brought back "Python's Unicode character type" into the docs. This needs to be fixed to say "legacy" somewhere, as the characters in a normal Unicode string are not of that type anymore.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 6, 2012

    STINNER Victor <report@bugs.python.org> wrote:
    > Hum, this issue is a regression from Python 3.2.
    > 
    > Python 3.2.3+ (3.2:243ad1a6f638+, Aug  4 2012, 01:36:41) 
    > [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
    > >>> import array
    > >>> a=array.array('u', 'xyz')
    > >>> b=memoryview(a)
    > >>> a == b
    > True
    > >>> b == a
    > True

    [3.3 returns False]

    That's actually deliberate. The new memoryview does not consider arrays equal
    if the format codes do not match, to avoid situations like (32-bit Python):

    Python 3.2a0 (py3k:76143M, Nov  7 2009, 17:05:38) 
    [GCC 4.2.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import array
    >>> a = array.array('f', [0])
    >>> b = array.array('i', [0])
    >>> x = memoryview(a)
    >>> y = memoryview(b)
    >>> 
    >>> a == b
    True
    >>> x == y
    True
    >>> 

    I think that (for buffers at least) an array of float should not compare
    equal to an array of int, especially since the 3.2 memoryview uses memcmp()
    in richcompare().

    See also the comment in the documentation for memoryview.format:

    http://docs.python.org/dev/library/stdtypes.html#memoryview-type

    memoryview is not aware of the 'u' format code, since it's not part of
    the struct syntax and the PEP-3118 proposition 'u' -> UCS2, 'w' -> UCS4
    wasn't considered useful.

    Now in your example I see that array's getbufferproc actually already uses
    'w' for UCS4. It would still be an option to make memoryview aware of
    'u' and 'w' (as suggested by the PEP).

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 6, 2012

    Also, it was suggested that 'u' should be deprecated:

    http://mail.python.org/pipermail/python-dev/2012-March/117392.html

    Personally, I don't have an opinion on that; I don't use the 'u'
    format code.

    Nick, could you have a look at msg167545 and see if any action
    should be taken?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 6, 2012

    Of course, if two formats *are* the same, it is possible to use
    memcmp(). I'll work on a patch.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Aug 6, 2012

    Perhaps if memoryview doesn't understand the format code, it can fall back on memcmp() if strcmp() indicates the format codes are the same?

    Otherwise we're at risk of breaking backwards compatibility with more than just array('u').

    Also, if it isn't already, the change to take format codes into a account in memoryview comparisons should be mentioned in the What's New porting section.

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 6, 2012

    memoryview is not aware of the 'u' format code, since it's not part of
    the struct syntax and the PEP-3118 proposition 'u' -> UCS2, 'w' -> UCS4
    wasn't considered useful.

    Did you see attached patch array_unicode_format.patch? It uses struct
    format "H" or "I" depending on the size of wchar_t.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 6, 2012

    Did you see attached patch array_unicode_format.patch? It uses struct
    format "H" or "I" depending on the size of wchar_t.

    I totally overlooked that. Given that memoryview can be fixed to
    compare buffers with unknown formats, I don't have a strong opinion
    on whether array's getbufferproc should alter the format codes of 'u'
    and 'w' or not.

    The only advantage for memoryview would be that tolist() etc.
    would work. However, tolist() previously only worked for bytes,
    so in this case raising an exception for 'u' and 'w' is not a
    regression but an improvement. :)

    If we're deprecating 'u' and 'w' anyway, the getbufferproc should
    probably continue to return 'u' and 'w' until the removal of these
    format codes.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Aug 6, 2012

    I think Victor's patch is a good solution to killing the 'u' and 'w' exports in 3.4, but we need to restore some tolerance for unknown format codes to memoryview in 3.3 regardless.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 6, 2012

    I have a patch already for the unknown format codes in memoryview.
    Currently fighting (as usual) with the case explosions in the tests.
    I think I can have a full patch by next weekend.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 8, 2012

    Someone broke the Windows buildbots.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 8, 2012

    New changeset e0f3406c43e4 by Victor Stinner in branch 'default':
    Issue bpo-13072: Fix test_array for Windows with 16-bit wchar_t
    http://hg.python.org/cpython/rev/e0f3406c43e4

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 8, 2012

    New changeset 67a994d5657d by Victor Stinner in branch 'default':
    Issue bpo-13072: Ooops, now fix test_array for Linux with 32-bit wchar_t...
    http://hg.python.org/cpython/rev/67a994d5657d

    @pitrou
    Copy link
    Member

    pitrou commented Aug 8, 2012

    And the test fails on machines without ctypes :)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 8, 2012

    New changeset 4ee4cceda047 by Victor Stinner in branch 'default':
    Issue bpo-13072: Fix test_array for installation without the ctypes module
    http://hg.python.org/cpython/rev/4ee4cceda047

    @birkenfeld
    Copy link
    Member

    Deferring.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2012

    Is there anything that still needs to be done on this issue? ISTM that the code is correct as it stands (i.e. Getting a buffer now only uses valid format codes)

    @ncoghlan
    Copy link
    Contributor

    There's still work to be done. The current status in 3.3 trunk is that:

    Wide build:
    >>> memoryview(array("u")).format
    'w'
    
    Narrow build:
    >>> memoryview(array("u")).format
    'u'

    Neither of these are valid struct formats, thus they don't play nicely with the assumptions of memoryview (or any other PEP-3118 consumer). Stefan's memoryview changes are needed because there are *valid* struct formats that memoryview doesn't understand (yet), but it's only coincidental that they will reduce the severity of this problem.

    Victor's latest patch switches the 'w' and 'u' for the appropriate integer sizes 'I' and 'H' which I think is an excellent approach.

    There are also the post-reversion documentation changes Georg requested to bring the docs back into line with PEP-393

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 11, 2012

    Wide build:
    >>> memoryview(array("u")).format
    'w'

    Narrow build:
    >>> memoryview(array("u")).format
    'u'

    Neither of these are valid struct formats, thus they don't play
    nicely with the assumptions of memoryview (or any other PEP-3118
    consumer).

    Why do you say that? They have been added by PEP-3118 (and are
    just not implemented in the struct module yet).

    If you think that their mentioning in PEP-3118 is a mistake,
    and they should not get implemented in struct, we should
    a) get consensus on that interpretation of the PEP, and
    b) actually remove them from the PEP, since otherwise it
    is very confusing that they keep being mentioned.
    I believe that the addition of these codes was fully
    intended by the PEP author, and also part of its acceptance.

    If these codes are indeed meant to be in the struct module,
    this usage in the array module looks right to me - hence
    my proposal to close the issue (the documentation problem
    aside).

    I agree that it is then desirable that the memoryview object
    supports the codes. However, this is separate issue from this
    one (as the codes are not invalid, just unsupported).

    @ncoghlan
    Copy link
    Contributor

    Adding a link to bpo-15625, which is discussing the other end of this issue (whether or not memorview should support 'u' as a typecode).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 16, 2012

    Based on the discussion in bpo-15625, it seems that the consensus is to take no action on the format codes in this issue for 3.3, and reconsider in 3.4, to determine in what way the struct module should support Unicode.

    Instead, the 'u' array code will be deprecated, in the same way in which the rest of the Py_UNICODE API is deprecated.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 19, 2012

    If everyone agrees on deprecating 'u', here's a patch. I think
    that should be sufficient to close this issue (unless we absolutely
    need deprecation warnings).

    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2012

    If everyone agrees on deprecating 'u', here's a patch. I think
    that should be sufficient to close this issue (unless we absolutely
    need deprecation warnings).

    I think a proper deprecation warning is preferable.

    @ncoghlan
    Copy link
    Contributor

    I guess the analogy with bytes objects is that UCS-2 code points can be
    handled as 16-bit integer objects.

    If we're going to do a programmatic deprecation now, that's the only
    alternative typecode currently available. Do we want to recommend that? Or
    do we want to postpone programmatic deprecation until we add a 2-byte code
    point type code for 3.4?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2012

    I guess the analogy with bytes objects is that UCS-2 code points can be
    handled as 16-bit integer objects.

    If we're going to do a programmatic deprecation now, that's the only
    alternative typecode currently available. Do we want to recommend that? Or
    do we want to postpone programmatic deprecation until we add a 2-byte code
    point type code for 3.4?

    I don't understand. If you want to handle 16-bit integers, you already
    have the "h" and "H" type codes.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 19, 2012

    Since actual removal is only scheduled for 4.0, I think user warnings
    can wait until 3.4.

    By then, we should have sorted out the struct format codes. In this
    scenario we would be sort of forced to use 'C', 'U' and 'W' as the
    new codes, while 'u' and 'w' would continue to linger in the array
    module for a while.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 24, 2012

    Stefan, your patch array_deprecate_u.diff is fine. If you get to it, please also rephrase the clause "Python's unicode type"; not sure what the convention is to refer to Py_UNICODE now (perhaps "historical unicode type").

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 24, 2012

    New changeset 9c7515e29219 by Stefan Krah in branch 'default':
    Issue bpo-13072: The array module's 'u' format code is now deprecated and
    http://hg.python.org/cpython/rev/9c7515e29219

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 24, 2012

    Good, I think this can be closed then.

    @skrah skrah mannequin closed this as completed Aug 24, 2012
    @skrah skrah mannequin added the type-bug An unexpected behavior, bug, or error label Aug 24, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants