Message 167280 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	certik
Recipients	Arfrever, alex, certik, dmalcolm, loewis, ncoghlan, pitrou, skrah, teoliphant, vstinner
Date	2012-08-03.00:21:50
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1343953313.37.0.330986246841.issue15540@psf.upfronthosting.co.za>
In-reply-to

Content
I wrote this initial patch for the issue last week: https://github.com/numpy/numpy/pull/366 with huge help from Stefan and others. As far as the unicode issue goes, Travis and I just talked about this and I think I now understand what is going on ---- the unicode type itself (as returned by the PyArray_Scalar() function in NumPy) should never have the byte swapped internals. In other words, the usage of the byte swapping is that if numpy happens to be pointing to a memory with byte swapped data (for example you save some data on big endian and you load it on little endian), let's say you have some strings (unicode). They will always be UCS4 inside numpy, possibly swapped. When the user actually calls things like my_array[1], then the PyArray_Scalar() looks at the memory, does any swapping (if necessary) and returns a valid unicode object on the current platform (with the correct endianness). The returned unicode can have any length (UCS1, UCS2 or UCS4 -- whatever Python likes), that doesn't really matter. So no changes are necessary to Python itself. As far as NumPy goes -- the tests are obviously wrong, because they happen to create unicode that is invalid. So the NumPy tests need to be fixed. Otherwise there is no problem. I am now working on a better version of my patch, that doesn't need to be forcing the unicode to be UCS4 so that it can swap its contents.

I wrote this initial patch for the issue last week:

https://github.com/numpy/numpy/pull/366

with huge help from Stefan and others.

As far as the unicode issue goes, Travis and I just talked about this and I think I now understand what is going on ---- the unicode type itself (as returned by the PyArray_Scalar() function in NumPy) should *never* have the byte swapped internals.

In other words, the usage of the byte swapping is that if numpy happens to be pointing to a memory with byte swapped data (for example you save some data on big endian and you load it on little endian), let's say you have some strings (unicode). They will always be UCS4 inside numpy, possibly swapped. When the user actually calls things like my_array[1], then the PyArray_Scalar() looks at the memory, does any swapping (if necessary) and returns a valid unicode object on the current platform (with the correct endianness). The returned unicode can have any length (UCS1, UCS2 or UCS4 -- whatever Python likes), that doesn't really matter.

So no changes are necessary to Python itself. As far as NumPy goes -- the tests are obviously wrong, because they happen to create unicode that is invalid. So the NumPy tests need to be fixed.

Otherwise there is no problem. I am now working on a better version of my patch, that doesn't need to be forcing the unicode to be UCS4 so that it can swap its contents.

History
Date	User	Action	Args
2012-08-03 00:21:53	certik	set	recipients: + certik, loewis, teoliphant, ncoghlan, pitrou, vstinner, Arfrever, alex, skrah, dmalcolm
2012-08-03 00:21:53	certik	set	messageid: <1343953313.37.0.330986246841.issue15540@psf.upfronthosting.co.za>
2012-08-03 00:21:52	certik	link	issue15540 messages
2012-08-03 00:21:50	certik	create