This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author skrah
Recipients loewis, ncoghlan, skrah
Date 2012-08-15.17:25:02
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1345051504.76.0.71358066623.issue15625@psf.upfronthosting.co.za>
In-reply-to
Content
Nick's comment in msg167963 got me thinking. Indeed, in Numpy the 'U'
specifier is similar to the struct module's 's' format code, only for
UCS4. So I'm questioning whether the current semantics of 'u' and 'w'
used by array.array were ever intended by the PEP authors:


import numpy

>>> nd = numpy.array(["A", "B"], dtype='U')
>>> nd
array(['A', 'B'],
      dtype='<U1')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00'
>>>
>>> nd = numpy.array(["ABC", "D"], dtype='U')
>>> nd
array(['ABC', 'D'],
      dtype='<U3')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>>


Internally, in NumPy 'U' is always UCS4, and the data type is a fixed
length string that has the length of the longest initializer element.


NumPy's use of 'U' seems vastly more useful for arrays than the behavior
of array.array:

>>> array.array('u', ['A', 'B'])
array('u', 'AB')
>>> array.array('u', ['ABC', 'D'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: array item must be unicode character


In Numpy, arrays of words are possible, with array.array they are not.

An additional thought: The convention in the struct module is to use
uppercase for unsigned types. So it would be a possibility to use
'C', 'U' and 'W', where '3C' would denote the same as '3s', except
for UCS1 instead of bytes.
History
Date User Action Args
2012-08-15 17:25:04skrahsetrecipients: + skrah, loewis, ncoghlan
2012-08-15 17:25:04skrahsetmessageid: <1345051504.76.0.71358066623.issue15625@psf.upfronthosting.co.za>
2012-08-15 17:25:04skrahlinkissue15625 messages
2012-08-15 17:25:02skrahcreate