Message 167338 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	teoliphant
Recipients	Arfrever, alex, certik, dmalcolm, loewis, ncoghlan, pitrou, skrah, teoliphant, vstinner
Date	2012-08-03.15:59:11
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<C8A9A526-AC51-40CB-A044-CB61F13A44F7@gmail.com>
In-reply-to	<20120803083522.Horde.FdzmXqGZi1VQG3EqdU-WNkA@webmail.df.eu>

Content
On Aug 3, 2012, at 1:35 AM, Martin v. Löwis wrote: > > Martin v. Löwis added the comment: > >> This is a mis-understanding of what NumPy does and why. There is >> a need to byte-swap only when the data is stored on disk in the >> reverse order from the native machine > > So is there ever a need to byte-swap Unicode strings? I can see how numeric > data are stored using the internal representation on disk; this is a common > technique. For strings, there is the notion of encodings which makes the > relationship between internal and disk representations. So if NumPy applies > the numeric concept to string data, then this is a flaw. Apologies for not using correct terminology. I had to spend a lot of time getting to know Unicode when I wrote NumPy, but am rusty on the key points and so I may communicate incorrectly. The NumPy representation of Unicode strings is always UTF-32BE or UTF-32LE (depending on the data-type of the array). The question is what to do when extracting this data into an array-scalar (which for Unicode objects has exactly the same representation as a PyUnicodeObject). In fact, the NumPy Unicode array scalar is a C-sub-type of PyUnicodeObject and inherits from both the PyUnicodeObject and the NumPy "Character" interface --- a likely rare example of dual-inheritance at the C-level. > > It may be that people really do store text data in the same memory blob > as numeric data and dump it to a file, but they really should think of this > data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte > swapping. > You can use PyUnicode_Decode to create a Unicode object given a void, > a length, and a codec name. The concept "native Unicode representation" > does not exist - people use all of two-byte, four-byte and UTF-8 > representations > in memory, on a single processor architecture and operating system. I understand all the representations of Unicode data. There is, however, a native byte-order and that's what I was talking about. > >> The byte-swapping must be done prior to conversion to a Python >> Unicode-Object when selecting data out of the array. > > So if the byte swapping is done before the Unicode object is created: > why did Dave and Ondřej run into problems then? There were at least 2 issues: 1) a bad test that was written by someone who didn't understand you shouldn't have "byte-swapped" unicode strings as "strings" and 2) a mis-understanding of what was happening going from the data stored in a NumPy array and the Python "scalar" object that was being created. . Thank you for your explanations. It's very helpful. Also, thank you for the PEP and improvements in Python 3.3. The situation is much* nicer now as NumPy is doing all kinds of hackery to support both narrow and wide builds. This hackery could likely be improved even pre Python 3.3, but it's more clear how to handle the situation now in Python 3.3 > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue15540> > _______________________________________

On Aug 3, 2012, at 1:35 AM, Martin v. Löwis wrote:

> 
> Martin v. Löwis added the comment:
> 
>> This is a mis-understanding of what NumPy does and why.    There is  
>> a need to byte-swap only when the data is stored on disk in the  
>> reverse order from the native machine
> 
> So is there ever a need to byte-swap Unicode strings? I can see how *numeric*
> data are stored using the internal representation on disk; this is a common
> technique. For strings, there is the notion of encodings which makes the
> relationship between internal and disk representations. So if NumPy applies
> the numeric concept to string data, then this is a flaw.

Apologies for not using correct terminology.   I had to spend a lot of time getting to know Unicode when I wrote NumPy, but am rusty on the key points and so I may communicate incorrectly.   The NumPy representation of Unicode strings is always UTF-32BE or UTF-32LE (depending on the data-type of the array).    The question is what to do when extracting this data into an array-scalar (which for Unicode objects has exactly the same representation as a PyUnicodeObject).  In fact, the NumPy Unicode array scalar is a C-sub-type of PyUnicodeObject and inherits from both the PyUnicodeObject and the NumPy "Character" interface --- a likely rare example of dual-inheritance at the C-level.  

> 
> It may be that people really do store text data in the same memory blob
> as numeric data and dump it to a file, but they really should think of this
> data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte  
> swapping.
> You can use PyUnicode_Decode to create a Unicode object given a void*,
> a length, and a codec name. The concept "native Unicode representation"
> does not exist - people use all of two-byte, four-byte and UTF-8  
> representations
> in memory, on a single processor architecture and operating system.

I understand all the representations of Unicode data.   There is, however, a native byte-order and that's what I was talking about. 

> 
>> The byte-swapping must be done prior to conversion to a Python  
>> Unicode-Object when selecting data out of the array.
> 
> So if the byte swapping is done before the Unicode object is created:
> why did Dave and Ondřej run into problems then?

There were at least  2 issues:   1) a bad test that was written by someone who didn't understand you shouldn't have "byte-swapped" unicode strings as "strings" and 2) a mis-understanding of what was happening going from the data stored in a NumPy array and the Python "scalar" object that was being created.   
.
Thank you for your explanations.   It's very helpful.   Also, thank you for the PEP and improvements in Python 3.3.   The situation is *much* nicer now as NumPy is doing all kinds of hackery to support both narrow and wide builds.    This hackery could likely be improved even pre Python 3.3, but it's more clear how to handle the situation now in Python 3.3

> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue15540>
> _______________________________________

History
Date	User	Action	Args
2012-08-03 15:59:14	teoliphant	set	recipients: + loewis, ncoghlan, pitrou, vstinner, Arfrever, certik, alex, skrah, dmalcolm
2012-08-03 15:59:14	teoliphant	link	issue15540 messages
2012-08-03 15:59:11	teoliphant	create