Message 167292 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	Arfrever, alex, certik, dmalcolm, loewis, ncoghlan, pitrou, skrah, teoliphant, vstinner
Date	2012-08-03.06:35:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<20120803083522.Horde.FdzmXqGZi1VQG3EqdU-WNkA@webmail.df.eu>
In-reply-to	<2BF1B7E4-F115-4C3C-8011-822965DF8A98@gmail.com>

Content
> This is a mis-understanding of what NumPy does and why. There is > a need to byte-swap only when the data is stored on disk in the > reverse order from the native machine So is there ever a need to byte-swap Unicode strings? I can see how numeric data are stored using the internal representation on disk; this is a common technique. For strings, there is the notion of encodings which makes the relationship between internal and disk representations. So if NumPy applies the numeric concept to string data, then this is a flaw. It may be that people really do store text data in the same memory blob as numeric data and dump it to a file, but they really should think of this data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte swapping. You can use PyUnicode_Decode to create a Unicode object given a void*, a length, and a codec name. The concept "native Unicode representation" does not exist - people use all of two-byte, four-byte and UTF-8 representations in memory, on a single processor architecture and operating system. > The byte-swapping must be done prior to conversion to a Python > Unicode-Object when selecting data out of the array. So if the byte swapping is done before the Unicode object is created: why did Dave and Ondřej run into problems then?

> This is a mis-understanding of what NumPy does and why.    There is  
> a need to byte-swap only when the data is stored on disk in the  
> reverse order from the native machine

So is there ever a need to byte-swap Unicode strings? I can see how *numeric*
data are stored using the internal representation on disk; this is a common
technique. For strings, there is the notion of encodings which makes the
relationship between internal and disk representations. So if NumPy applies
the numeric concept to string data, then this is a flaw.

It may be that people really do store text data in the same memory blob
as numeric data and dump it to a file, but they really should think of this
data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte  
swapping.
You can use PyUnicode_Decode to create a Unicode object given a void*,
a length, and a codec name. The concept "native Unicode representation"
does not exist - people use all of two-byte, four-byte and UTF-8  
representations
in memory, on a single processor architecture and operating system.

> The byte-swapping must be done prior to conversion to a Python  
> Unicode-Object when selecting data out of the array.

So if the byte swapping is done before the Unicode object is created:
why did Dave and Ondřej run into problems then?

History
Date	User	Action	Args
2012-08-03 06:35:24	loewis	set	recipients: + loewis, teoliphant, ncoghlan, pitrou, vstinner, Arfrever, certik, alex, skrah, dmalcolm
2012-08-03 06:35:23	loewis	link	issue15540 messages
2012-08-03 06:35:23	loewis	create