New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support u and w codes in memoryview #59830
Comments
Currently, the following test case fails: >>> import array
>>> a=array.array('u', 'foo')
>>> memoryview(a)==memoryview(a)
False This is because the memoryview object doesn't support the u and w codes, as it should per PEP-3118. This patch fixes it. |
Nick's comment in msg167963 got me thinking. Indeed, in Numpy the 'U' import numpy >>> nd = numpy.array(["A", "B"], dtype='U')
>>> nd
array(['A', 'B'],
dtype='<U1')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00'
>>>
>>> nd = numpy.array(["ABC", "D"], dtype='U')
>>> nd
array(['ABC', 'D'],
dtype='<U3')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> Internally, in NumPy 'U' is always UCS4, and the data type is a fixed NumPy's use of 'U' seems vastly more useful for arrays than the behavior >>> array.array('u', ['A', 'B'])
array('u', 'AB')
>>> array.array('u', ['ABC', 'D'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: array item must be unicode character In Numpy, arrays of words are possible, with array.array they are not. An additional thought: The convention in the struct module is to use |
Travis: can you please comment on what the intended semantics of the 'u' and 'w' specifiers is, in PEP-3118? More specifically:
|
I admit that the main thing that bothers me with the proposal in PEP-3118 is the inconsistency between c -> bytes, while u, w -> str This was less of an issue in 2.x (which was the main frame of reference when the PEP was written), with implicit str/unicode interoperability, but seems quite jarring in the 3.x world. Status quo: My current inclination is still to apply Victor's patch from bpo-13072 (which changes array to export the appropriate integer typecodes for 'u' arrays) and otherwise punt on this for 3.3 and try to sort out the mess for 3.4. For 3.4, I'm inclined to favour Stefan's proposal of C, U, W mapping to multi-point sequences of UCS-1, UCS-2, UCS-4 code points (with corresponding typecodes in the array module). Support for lowercase 'u' would then never become an official part of the buffer API, existing only as an array typecode. |
I think this would be the worst choice. It would mean that we change Note that the array module had been using 'u' and 'w' essentially
Fine with me in principle, although I see a problem when NumPy uses |
I wouldn't change the export formats used for the 'u' typecode at all in 3.4 - I'd add new typecodes to array that match any new struct format characters and are exported accordingly. 'u' would *never* become a formally defined struct character, instead lingering in the array module as a legacy of the narrow/wide build distinction. And good point that U would need to match UCS-4 to be consistent with NumPy. Perhaps we can just add 'U' in 3.4 and forget about UCS-2 entirely? |
I think it is a desirable property that for an array A and an index So if we do want to support Unicode arrays (which some people apparently I would be fine with deprecating the 'u' type arrays, acknowledging It should be fixed only if we want to support it "properly" (which I |
I guess the main alternative to deprecation that preserves the invariant you describe would be to propagate the "u == Py_UNICODE" definition to memoryview. Since we're trying to phase out Py_UNICODE, deprecation seems the more sensible course. Perhaps just a documented deprecation for now, like the rest of the Py_UNICODE based APIs? |
Martin v. Loewis <report@bugs.python.org> wrote:
From the perspective of memoryview backwards compatibility, deprecation is fine. >>> import array
>>> a = array.array('u', "ABC")
>>> x = memoryview(a)
>>> a[0] == x[0]
False
>>> a[0]
'A'
# Indexing returns bytes instead of str:
>>> x[0]
b'A\x00'
>>>
# Index assignment attempts to do slice assignment:
>>> x[0] = 'Z'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
>>> I'm +1 for deprecating 'u' and 'w' in the array module, accept that memoryview |
Well, apparently people do use 'u', see bpo-15035. |
bpo-15035 indicates that there is a need for UCS-2 arrays, using 'u' arrays was technically incorrect, since it is based on Py_UNICODE, whereas the API in question uses UniChar (which apparently is a two-byte type). |
Martin v. L??wis <report@bugs.python.org> wrote:
Right, thanks for clearing that up. Then bpo-15035 would indeed support deprecating |
The documentation already specifies that 'u' is deprecated and doesn't mention the 'w' code. I think we can close this issue. |
Closing sounds good to me |
Trigger the same bug.... I want to effectively slice big unicode string. So I decide to use memoryview for that in order to eliminate memory copying. In [33]: a = array.array('u', 'превед')
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: