Message 168313 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	skrah
Recipients	loewis, ncoghlan, skrah
Date	2012-08-15.17:25:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1345051504.76.0.71358066623.issue15625@psf.upfronthosting.co.za>
In-reply-to

Content
Nick's comment in msg167963 got me thinking. Indeed, in Numpy the 'U' specifier is similar to the struct module's 's' format code, only for UCS4. So I'm questioning whether the current semantics of 'u' and 'w' used by array.array were ever intended by the PEP authors: import numpy >>> nd = numpy.array(["A", "B"], dtype='U') >>> nd array(['A', 'B'], dtype='<U1') >>> nd.tostring() b'A\x00\x00\x00B\x00\x00\x00' >>> >>> nd = numpy.array(["ABC", "D"], dtype='U') >>> nd array(['ABC', 'D'], dtype='<U3') >>> nd.tostring() b'A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>> Internally, in NumPy 'U' is always UCS4, and the data type is a fixed length string that has the length of the longest initializer element. NumPy's use of 'U' seems vastly more useful for arrays than the behavior of array.array: >>> array.array('u', ['A', 'B']) array('u', 'AB') >>> array.array('u', ['ABC', 'D']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: array item must be unicode character In Numpy, arrays of words are possible, with array.array they are not. An additional thought: The convention in the struct module is to use uppercase for unsigned types. So it would be a possibility to use 'C', 'U' and 'W', where '3C' would denote the same as '3s', except for UCS1 instead of bytes.

Nick's comment in msg167963 got me thinking. Indeed, in Numpy the 'U'
specifier is similar to the struct module's 's' format code, only for
UCS4. So I'm questioning whether the current semantics of 'u' and 'w'
used by array.array were ever intended by the PEP authors:


import numpy

>>> nd = numpy.array(["A", "B"], dtype='U')
>>> nd
array(['A', 'B'],
      dtype='<U1')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00'
>>>
>>> nd = numpy.array(["ABC", "D"], dtype='U')
>>> nd
array(['ABC', 'D'],
      dtype='<U3')
>>> nd.tostring()
b'A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>>


Internally, in NumPy 'U' is always UCS4, and the data type is a fixed
length string that has the length of the longest initializer element.


NumPy's use of 'U' seems vastly more useful for arrays than the behavior
of array.array:

>>> array.array('u', ['A', 'B'])
array('u', 'AB')
>>> array.array('u', ['ABC', 'D'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: array item must be unicode character


In Numpy, arrays of words are possible, with array.array they are not.

An additional thought: The convention in the struct module is to use
uppercase for unsigned types. So it would be a possibility to use
'C', 'U' and 'W', where '3C' would denote the same as '3s', except
for UCS1 instead of bytes.

History
Date	User	Action	Args
2012-08-15 17:25:04	skrah	set	recipients: + skrah, loewis, ncoghlan
2012-08-15 17:25:04	skrah	set	messageid: <1345051504.76.0.71358066623.issue15625@psf.upfronthosting.co.za>
2012-08-15 17:25:04	skrah	link	issue15625 messages
2012-08-15 17:25:02	skrah	create