Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support u and w codes in memoryview #59830

Closed
loewis mannequin opened this issue Aug 11, 2012 · 15 comments
Closed

Support u and w codes in memoryview #59830

loewis mannequin opened this issue Aug 11, 2012 · 15 comments

Comments

@loewis
Copy link
Mannequin

loewis mannequin commented Aug 11, 2012

BPO 15625
Nosy @loewis, @ronaldoussoren, @ncoghlan, @skrah, @wiggin15, @socketpair, @zooba
Files
  • uwcodes.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-04-15.20:03:15.886>
    created_at = <Date 2012-08-11.21:16:27.276>
    labels = []
    title = 'Support u and w codes in memoryview'
    updated_at = <Date 2016-05-25.12:04:08.486>
    user = 'https://github.com/loewis'

    bugs.python.org fields:

    activity = <Date 2016-05-25.12:04:08.486>
    actor = 'socketpair'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-04-15.20:03:15.886>
    closer = 'steve.dower'
    components = []
    creation = <Date 2012-08-11.21:16:27.276>
    creator = 'loewis'
    dependencies = []
    files = ['26769']
    hgrepos = []
    issue_num = 15625
    keywords = ['patch']
    message_count = 15.0
    messages = ['168009', '168313', '168318', '168345', '168365', '168368', '168370', '168371', '168372', '168375', '168377', '168380', '241148', '241150', '266335']
    nosy_count = 9.0
    nosy_names = ['loewis', 'teoliphant', 'ronaldoussoren', 'ncoghlan', 'Arfrever', 'skrah', 'wiggin15', 'socketpair', 'steve.dower']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue15625'
    versions = ['Python 3.4']

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Aug 11, 2012

    Currently, the following test case fails:

    >>> import array
    >>> a=array.array('u', 'foo')
    >>> memoryview(a)==memoryview(a)
    False

    This is because the memoryview object doesn't support the u and w codes, as it should per PEP-3118. This patch fixes it.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 15, 2012

    Nick's comment in msg167963 got me thinking. Indeed, in Numpy the 'U'
    specifier is similar to the struct module's 's' format code, only for
    UCS4. So I'm questioning whether the current semantics of 'u' and 'w'
    used by array.array were ever intended by the PEP authors:

    import numpy
    >>> nd = numpy.array(["A", "B"], dtype='U')
    >>> nd
    array(['A', 'B'],
          dtype='<U1')
    >>> nd.tostring()
    b'A\x00\x00\x00B\x00\x00\x00'
    >>>
    >>> nd = numpy.array(["ABC", "D"], dtype='U')
    >>> nd
    array(['ABC', 'D'],
          dtype='<U3')
    >>> nd.tostring()
    b'A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
    >>>

    Internally, in NumPy 'U' is always UCS4, and the data type is a fixed
    length string that has the length of the longest initializer element.

    NumPy's use of 'U' seems vastly more useful for arrays than the behavior
    of array.array:

    >>> array.array('u', ['A', 'B'])
    array('u', 'AB')
    >>> array.array('u', ['ABC', 'D'])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: array item must be unicode character

    In Numpy, arrays of words are possible, with array.array they are not.

    An additional thought: The convention in the struct module is to use
    uppercase for unsigned types. So it would be a possibility to use
    'C', 'U' and 'W', where '3C' would denote the same as '3s', except
    for UCS1 instead of bytes.

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Aug 15, 2012

    Travis: can you please comment on what the intended semantics of the 'u' and 'w' specifiers is, in PEP-3118? More specifically:

    • "an array/memoryview with format 'u' can support exactly one-character values (i.e. unicode strings of length 1)": true or false?
    • "in a struct, an element of type 'u' will use up two bytes exactly (ignoring padding)": true or false?

    @ncoghlan
    Copy link
    Contributor

    I admit that the main thing that bothers me with the proposal in PEP-3118 is the inconsistency between c -> bytes, while u, w -> str

    This was less of an issue in 2.x (which was the main frame of reference when the PEP was written), with implicit str/unicode interoperability, but seems quite jarring in the 3.x world.

    Status quo:
    struct module: 'c' = individual bytes, 's' = multi-byte sequence
    array module: 'u' typecode may be either 2 bytes or 4 bytes (Py_UNICODE) (the addition of the 'w' typecode has been reverted)

    My current inclination is still to apply Victor's patch from bpo-13072 (which changes array to export the appropriate integer typecodes for 'u' arrays) and otherwise punt on this for 3.3 and try to sort out the mess for 3.4.

    For 3.4, I'm inclined to favour Stefan's proposal of C, U, W mapping to multi-point sequences of UCS-1, UCS-2, UCS-4 code points (with corresponding typecodes in the array module).

    Support for lowercase 'u' would then never become an official part of the buffer API, existing only as an array typecode.

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Aug 16, 2012

    My current inclination is still to apply Victor's patch from bpo-13072
    (which changes array to export the appropriate integer typecodes for
    'u' arrays) and otherwise punt on this for 3.3 and try to sort out
    the mess for 3.4.

    I think this would be the worst choice. It would mean that we change
    the format for exported array.arrays now for 3.3, and then change it
    in 3.4 again. So anybody who cares about this would have to deal
    with three different behaviors.

    Note that the array module had been using 'u' and 'w' essentially
    "forever" (i.e. since 3.0).

    For 3.4, I'm inclined to favour Stefan's proposal of C, U, W mapping
    to multi-point sequences of UCS-1, UCS-2, UCS-4 code points (with
    corresponding typecodes in the array module).

    Fine with me in principle, although I see a problem when NumPy uses
    'U' for UCS-4, yet CPython declares it to be UCS-2. I also think that
    Travis' explicit agreement must be sought.

    @ncoghlan
    Copy link
    Contributor

    I wouldn't change the export formats used for the 'u' typecode at all in 3.4 - I'd add new typecodes to array that match any new struct format characters and are exported accordingly. 'u' would *never* become a formally defined struct character, instead lingering in the array module as a legacy of the narrow/wide build distinction.

    And good point that U would need to match UCS-4 to be consistent with NumPy. Perhaps we can just add 'U' in 3.4 and forget about UCS-2 entirely?

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Aug 16, 2012

    I wouldn't change the export formats used for the 'u' typecode at
    all in 3.4 - I'd add new typecodes to array that match any new
    struct format characters and are exported accordingly. 'u' would
    *never* become a formally defined struct character, instead
    lingering in the array module as a legacy of the narrow/wide build
    distinction.

    I think it is a desirable property that for an array A and an index
    I, that A[I] == memoryview(A)[I]. Exporting the elements of an 'u'
    array as integers would break that property.

    So if we do want to support Unicode arrays (which some people apparently
    want to see - I haven't heard anybody saying they actually *need* such
    a type), the buffer type of it should be "unicode", in some form, not
    "number".

    I would be fine with deprecating the 'u' type arrays, acknowledging
    that the Py_UNICODE element type is even more useless now than before.
    If that is done, there is no point in fixing anything about it. If
    it exports using the 'u' and 'w' codes - fine. If then memoryview
    doesn't work properly - fine; this is a deprecated feature.

    It should be fixed only if we want to support it "properly" (which I
    believe this patch would do).

    @ncoghlan
    Copy link
    Contributor

    I guess the main alternative to deprecation that preserves the invariant you describe would be to propagate the "u == Py_UNICODE" definition to memoryview. Since we're trying to phase out Py_UNICODE, deprecation seems the more sensible course.

    Perhaps just a documented deprecation for now, like the rest of the Py_UNICODE based APIs?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 16, 2012

    Martin v. Loewis <report@bugs.python.org> wrote:

    I would be fine with deprecating the 'u' type arrays, acknowledging
    that the Py_UNICODE element type is even more useless now than before.
    If that is done, there is no point in fixing anything about it. If
    it exports using the 'u' and 'w' codes - fine. If then memoryview
    doesn't work properly - fine; this is a deprecated feature.

    From the perspective of memoryview backwards compatibility, deprecation is fine.
    In 3.2, memoryview could really only handle one-dimensional buffers of unsigned
    bytes:

    >>> import array
    >>> a = array.array('u', "ABC")
    >>> x = memoryview(a)
    >>> a[0] == x[0]
    False
    >>> a[0]
    'A'
    
    # Indexing returns bytes instead of str:
    >>> x[0]
    b'A\x00'
    >>> 
    
    # Index assignment attempts to do slice assignment:
    >>> x[0] = 'Z'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'str' does not support the buffer interface
    >>> 

    I'm +1 for deprecating 'u' and 'w' in the array module, accept that memoryview
    cannot handle 'u' and 'w' and fix the situation properly in 3.4. I agree that
    the latter would require people to come up with actual use cases.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 16, 2012

    Well, apparently people do use 'u', see bpo-15035.

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Aug 16, 2012

    bpo-15035 indicates that there is a need for UCS-2 arrays, using 'u' arrays was technically incorrect, since it is based on Py_UNICODE, whereas the API in question uses UniChar (which apparently is a two-byte type).

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 16, 2012

    Martin v. L??wis <report@bugs.python.org> wrote:

    bpo-15035 indicates that there is a need for UCS-2 arrays, using 'u' arrays was technically incorrect, since it is based on Py_UNICODE, whereas the API in question uses UniChar (which apparently is a two-byte type).

    Right, thanks for clearing that up. Then bpo-15035 would indeed support deprecating
    'u' and 'w' and moving on to UCS2 and UCS4 arrays.

    @wiggin15
    Copy link
    Mannequin

    wiggin15 mannequin commented Apr 15, 2015

    The documentation already specifies that 'u' is deprecated and doesn't mention the 'w' code. I think we can close this issue.

    @zooba
    Copy link
    Member

    zooba commented Apr 15, 2015

    Closing sounds good to me

    @zooba zooba closed this as completed Apr 15, 2015
    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented May 25, 2016

    Trigger the same bug....

    I want to effectively slice big unicode string. So I decide to use memoryview for that in order to eliminate memory copying.

    In [33]: a = array.array('u', 'превед')
    In [34]: m = memoryview(a)
    In [35]: m[2:]
    Out[35]: <memory at 0x7efc98fcc048>
    In [36]: m[0]
    ...
    NotImplementedError: memoryview: format w not supported

    1. Why format 'w' error if I asked 'u' ?
    2. Format 'w' is not listed in https://docs.python.org/3.5/library/array.html
    3. What is alternative for fast slicing, like memoryview(bytearray(b'test')), but for unicode ?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants