This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Python 3.3 and numpy
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, alex, certik, dmalcolm, loewis, ncoghlan, pitrou, skrah, teoliphant, vstinner
Priority: normal Keywords: patch

Created on 2012-08-02 21:10 by dmalcolm, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
get-numpy-working-with-python-3.3.patch dmalcolm, 2012-08-02 21:10 Work-in-progress patch for numpy to get it working with 3.3
hack-out-test-against-MAX_UNICODE-from-cpython-3.3.patch dmalcolm, 2012-08-02 21:11 Dirty dirty hack applied to CPython 3.3. to get numpy to work review
Messages (16)
msg167256 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-08-02 21:10
I've been trying to get numpy working with Python 3.3, and to so I had to make some changes to CPython - hence I'm posting this to the Python bug tracker.

numpy pokes at the insides of PyUnicodeObject in a few places and is thus affected by the PEP 393 changes in Python 3.3

I'm attaching my latest work-in-progress patch for numpy, which mostly works (it has 3 remaining errors).

AIUI, the "numpy.str_" type subclasses PyUnicodeObject but with its own custom allocator, which takes a size (this is called in PyArray_Scalar when type_num == NPY_UNICODE).  unicode_new_subtype calls tp_alloc but passes in 0 for the size, so we can't use that.  So I had to reimplement parts of unicode creation in-place within numpy's PyArray_Scalar, copying macros from out of cpython's unicodeobject.c

The other wart is that, AIUI, numpy supports byte-swapping the values within an array, and when this is done for a unicode array, it byte-swaps the 4-byte UCS4 values.  At that point, it's unlikely that the resulting 4-byte values are below 0x10ffff, leading to various failures from inside CPython's unicode handling.

So I hacked those test from out of CPython :)   I'm attaching the diff I've got against cpython (clearly just a hack at this stage).

I may of course be misunderstanding the insides of numpy.

With these changes, the numpy test suite runs, with just these remaining errors:

======================================================================
ERROR: Ticket #16
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/numpy/core/tests/test_regression.py", line 41, in test_pickle_transposed
    b = pickle.load(f)
EOFError

======================================================================
ERROR: test_power_zero (test_umath.TestPower)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/numpy/core/tests/test_umath.py", line 139, in test_power_zero
    assert_complex_equal(np.power(zero, 0+1j), cnan)
RuntimeWarning: invalid value encountered in power

======================================================================
ERROR: Failure: ValueError (can't handle version 187 of numpy.ndarray pickle)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/nose-1.1.2-py3.3.egg/nose/failure.py", line 37, in runTest
    raise self.exc_class(self.exc_val).with_traceback(self.tb)
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/nose-1.1.2-py3.3.egg/nose/loader.py", line 232, in generate
    for test in g():
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/numpy/lib/tests/test_format.py", line 429, in test_roundtrip
    arr2 = roundtrip(arr)
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/numpy/lib/tests/test_format.py", line 420, in roundtrip
    arr2 = format.read_array(f2)
  File "/home/david/coding/python3.3/local-install/lib/python3.3/site-packages/numpy/lib/format.py", line 449, in read_array
    array = pickle.load(fp)
ValueError: can't handle version 187 of numpy.ndarray pickle

----------------------------------------------------------------------
Ran 4776 tests in 68.189s

FAILED (KNOWNFAIL=6, SKIP=1, errors=3)
msg167262 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-08-02 22:02
A couple of days ago there was another effort by Ondřej Čertík to get
NumPy working with 3.3, see the thread starting here:

http://comments.gmane.org/gmane.comp.python.numeric.general/51087

I participated in that discussion and we hit the same problem with
the byte-swapped arrays: Since the generated Unicode strings are invalid,
the consistency checks fail in debug mode.

I wonder if taking out that assert is the right thing though: The
use case is pretty unique; could NumPy not convert these byte-swapped
arrays to uint32?
msg167263 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-08-02 22:13
The byte swapping in numpy is clearly a misfeature when applied to Unicode. I don't think Python should support that, and I can't imagine anybody is using this for a purpose that couldn't also be achieved in a better way. Not sure why you are submitting these patches here - consider this one rejected.

I'm also not sure why you posted the numpy patch here; we cannot process it in any meaningful way. If you just need to find a place in the web to post it, please use wiki.python.org.

In general, I'm (personally) quite opposed to using the tracker for work-in-progress. Tracker issues should be actionable by core committer (reject, commit, request further changes). If the author is aware that it is not ready for review yet, it shouldn't go into the tracker.

If you primarily wanted to report these incompatibilies: that's appreciated. I can add them to

http://www.python.org/dev/peps/pep-0393/#deprecations-removals-and-incompatibilities

I don't think anything can be done about this: any C extension type subclassing the Unicode type needs to be reviewed; I don't think any reasonable backwards compatibility can be applied. This is unfortunate, but without a good solution.

Likewise, code that mutates what is designed to be an immutable object is prone to break. I have no regrets about this breakage.
msg167268 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-02 22:28
Agreed with Martin. Byte-swapped unicode data in unicode objects doesn't make sense, since it will break the semantics of many operations. If numpy wants to support byte-swapped unicode data (what for?), they should store it in a different object type.
msg167271 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2012-08-02 23:01
On Aug 2, 2012, at 5:28 PM, Antoine Pitrou wrote:

> 
> Antoine Pitrou added the comment:
> 
> Agreed with Martin. Byte-swapped unicode data in unicode objects doesn't make sense, since it will break the semantics of many operations. If numpy wants to support byte-swapped unicode data (what for?), they should store it in a different object type.

This is a mis-understanding of what NumPy does and why.    There is a need to byte-swap only when the data is stored on disk in the reverse order from the native machine (i.e. NumPy is pointing to memory-mapped data).    

The byte-swapping must be done prior to conversion to a Python Unicode-Object when selecting data out of the array.   

-Travis
msg167274 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-02 23:09
> The byte-swapping must be done prior to conversion to a Python
> Unicode-Object when selecting data out of the array.

But then it shouldn't affect the invariants which are commented out in
Dave's patch.
msg167275 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-08-02 23:15
> There is a need to byte-swap only when the data is stored on disk in the reverse order from the native machine (i.e. NumPy is pointing to memory-mapped data).

In that case it should be a matter of disabling some NumPy unit tests.
It seems that currently generating a Unicode object with non-native
byte-order is being tested, hence the reported failures.
msg167276 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2012-08-02 23:26
On Aug 2, 2012, at 6:09 PM, Antoine Pitrou wrote:

> 
> Antoine Pitrou added the comment:
> 
>> The byte-swapping must be done prior to conversion to a Python
>> Unicode-Object when selecting data out of the array.
> 
> But then it shouldn't affect the invariants which are commented out in
> Dave's patch.
> 

My impression is that Python should not have to change anything, but NumPy needs to adapt to the new Unicode concepts (which I think are great, by the way --- I'm a big supporter of getting rid of the wide/narrow build distinction). 

> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue15540>
> _______________________________________
msg167278 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-08-03 00:18
While I agree this is important (e.g. I know Dave started looking into this as getting NumPy working is currently a blocker for Fedora migrating their Python 3 stack to 3.3), the burden is definitely on the NumPy side to get the code point values using the right byte order before handing them over to the CPython API. If the sanity checks in the Unicode implementation are firing, it means something is still wrong on the NumPy side (perhaps erroneous unit tests, as Stefan reports).

From the point of view of CPython, an array of byte-swapped code points isn't text, it's an array of 16 or 32 bit integers, and NumPy's text handling needs to work within that constraint.

FWIW, with regard to Martin's tangential comment about appropriate use of the tracker, I'm personally fine with using the tracker for 'I found this problem, attempted to fix it (but failed), here's my attempt'. It's part of the wide spectrum of issue reporting that ranges from 'I found this problem, but have no idea what is causing it or where to even begin trying to fix it' through 'I found this problem, here's a fix that worked for me' all the way to 'I found this problem, here's a fix that works on multiple platforms with new tests, insightful doc adjustments and a pony' :)
msg167280 - (view) Author: Ondřej Čertík (certik) Date: 2012-08-03 00:21
I wrote this initial patch for the issue last week:

https://github.com/numpy/numpy/pull/366

with huge help from Stefan and others.

As far as the unicode issue goes, Travis and I just talked about this and I think I now understand what is going on ---- the unicode type itself (as returned by the PyArray_Scalar() function in NumPy) should *never* have the byte swapped internals.

In other words, the usage of the byte swapping is that if numpy happens to be pointing to a memory with byte swapped data (for example you save some data on big endian and you load it on little endian), let's say you have some strings (unicode). They will always be UCS4 inside numpy, possibly swapped. When the user actually calls things like my_array[1], then the PyArray_Scalar() looks at the memory, does any swapping (if necessary) and returns a valid unicode object on the current platform (with the correct endianness). The returned unicode can have any length (UCS1, UCS2 or UCS4 -- whatever Python likes), that doesn't really matter.

So no changes are necessary to Python itself. As far as NumPy goes -- the tests are obviously wrong, because they happen to create unicode that is invalid. So the NumPy tests need to be fixed.

Otherwise there is no problem. I am now working on a better version of my patch, that doesn't need to be forcing the unicode to be UCS4 so that it can swap its contents.
msg167292 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-08-03 06:35
> This is a mis-understanding of what NumPy does and why.    There is  
> a need to byte-swap only when the data is stored on disk in the  
> reverse order from the native machine

So is there ever a need to byte-swap Unicode strings? I can see how *numeric*
data are stored using the internal representation on disk; this is a common
technique. For strings, there is the notion of encodings which makes the
relationship between internal and disk representations. So if NumPy applies
the numeric concept to string data, then this is a flaw.

It may be that people really do store text data in the same memory blob
as numeric data and dump it to a file, but they really should think of this
data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte  
swapping.
You can use PyUnicode_Decode to create a Unicode object given a void*,
a length, and a codec name. The concept "native Unicode representation"
does not exist - people use all of two-byte, four-byte and UTF-8  
representations
in memory, on a single processor architecture and operating system.

> The byte-swapping must be done prior to conversion to a Python  
> Unicode-Object when selecting data out of the array.

So if the byte swapping is done before the Unicode object is created:
why did Dave and Ondřej run into problems then?
msg167301 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-08-03 09:18
> FWIW, with regard to Martin's tangential comment about appropriate  
> use of the tracker, I'm personally fine with using the tracker for  
> 'I found this problem, attempted to fix it (but failed), here's my  
> attempt'.

I don't mind that at all, either. What I dislike is "I have this issue,
here is what I've got, and I will continue to work on it" kind of reports
(when Dave clearly said that his patch is work-in-progress). There is a
worse kind, where people say "I have this issue", followed (after 15 minutes)
with "I found out more", and then going on with that for a while. I feel
that this wastes the reader's time (who may actually start to work on the
issue as well, duplicating efforts) - this is something that submitters
really need to consider. It's not this bad in this issue, since Dave said
from the beginning that it is work-in-progress.

On-topic: what action do you suggest for this issue?
msg167303 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-08-03 09:51
Martin v. L??wis <report@bugs.python.org> wrote:
> I don't mind that at all, either. What I dislike is "I have this issue,
> here is what I've got, and I will continue to work on it" kind of reports
> (when Dave clearly said that his patch is work-in-progress). There is a
> worse kind, where people say "I have this issue", followed (after 15 minutes)
> with "I found out more", and then going on with that for a while. I feel
> that this wastes the reader's time (who may actually start to work on the
> issue as well, duplicating efforts) - this is something that submitters
> really need to consider. It's not this bad in this issue, since Dave said
> from the beginning that it is work-in-progress.

I think Dave's main issue was: Here's a problem with NumPy/3.3, could we
perhaps remove an assert() to keep NumPy working. That was the one-liner
patch.

The NumPy patches were probably there for illustration. -- It might be unusual
to post patches for third party packages on the tracker, but NumPy is in a
very special situation right now:

Version 1.7 is about to be released and is currently not 3.3 compatible.
As Nick said, NumPy is a blocker for Python-3.3 integration into Fedora.

So, I hope that Travis doesn't mind if people keep pushing the issue a bit.
On the positive side, it's a sign that people care a lot about NumPy.

> On-topic: what action do you suggest for this issue?

As for the action, I think everyone agrees that no changes to Python are
necessary.
msg167327 - (view) Author: Ondřej Čertík (certik) Date: 2012-08-03 13:41
Martin,

> So if the byte swapping is done before the Unicode object is created:
> why did Dave and Ondřej run into problems then?

As I wrote above (http://bugs.python.org/msg167280), this happened because of wrong NumPy tests, that need to be fixed. They are testing some byte swapping issues and they do produce an invalid unicode in the process --- this is a bug and we need to fix it in NumPy.
msg167335 - (view) Author: Ondřej Čertík (certik) Date: 2012-08-03 14:57
Here is my new patch:

https://github.com/numpy/numpy/pull/372

It implements what I was talking about (and fixes the NumPy tests bug).
msg167338 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2012-08-03 15:59
On Aug 3, 2012, at 1:35 AM, Martin v. Löwis wrote:

> 
> Martin v. Löwis added the comment:
> 
>> This is a mis-understanding of what NumPy does and why.    There is  
>> a need to byte-swap only when the data is stored on disk in the  
>> reverse order from the native machine
> 
> So is there ever a need to byte-swap Unicode strings? I can see how *numeric*
> data are stored using the internal representation on disk; this is a common
> technique. For strings, there is the notion of encodings which makes the
> relationship between internal and disk representations. So if NumPy applies
> the numeric concept to string data, then this is a flaw.

Apologies for not using correct terminology.   I had to spend a lot of time getting to know Unicode when I wrote NumPy, but am rusty on the key points and so I may communicate incorrectly.   The NumPy representation of Unicode strings is always UTF-32BE or UTF-32LE (depending on the data-type of the array).    The question is what to do when extracting this data into an array-scalar (which for Unicode objects has exactly the same representation as a PyUnicodeObject).  In fact, the NumPy Unicode array scalar is a C-sub-type of PyUnicodeObject and inherits from both the PyUnicodeObject and the NumPy "Character" interface --- a likely rare example of dual-inheritance at the C-level.  

> 
> It may be that people really do store text data in the same memory blob
> as numeric data and dump it to a file, but they really should think of this
> data as "UTF-16-BE" or "UTF-32-LE" and the like, not in terms of byte  
> swapping.
> You can use PyUnicode_Decode to create a Unicode object given a void*,
> a length, and a codec name. The concept "native Unicode representation"
> does not exist - people use all of two-byte, four-byte and UTF-8  
> representations
> in memory, on a single processor architecture and operating system.

I understand all the representations of Unicode data.   There is, however, a native byte-order and that's what I was talking about. 

> 
>> The byte-swapping must be done prior to conversion to a Python  
>> Unicode-Object when selecting data out of the array.
> 
> So if the byte swapping is done before the Unicode object is created:
> why did Dave and Ondřej run into problems then?

There were at least  2 issues:   1) a bad test that was written by someone who didn't understand you shouldn't have "byte-swapped" unicode strings as "strings" and 2) a mis-understanding of what was happening going from the data stored in a NumPy array and the Python "scalar" object that was being created.   
.
Thank you for your explanations.   It's very helpful.   Also, thank you for the PEP and improvements in Python 3.3.   The situation is *much* nicer now as NumPy is doing all kinds of hackery to support both narrow and wide builds.    This hackery could likely be improved even pre Python 3.3, but it's more clear how to handle the situation now in Python 3.3

> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue15540>
> _______________________________________
History
Date User Action Args
2022-04-11 14:57:33adminsetgithub: 59745
2012-08-03 15:59:14teoliphantsetmessages: + msg167338
2012-08-03 14:57:43certiksetmessages: + msg167335
2012-08-03 13:41:11certiksetmessages: + msg167327
2012-08-03 09:51:55skrahsetmessages: + msg167303
2012-08-03 09:18:09loewissetmessages: + msg167301
2012-08-03 06:35:23loewissetmessages: + msg167292
2012-08-03 00:21:52certiksetmessages: + msg167280
2012-08-03 00:18:53ncoghlansetstatus: open -> closed
resolution: not a bug
messages: + msg167278

stage: resolved
2012-08-02 23:26:41teoliphantsetmessages: + msg167276
2012-08-02 23:23:30skrahsetnosy: + certik
2012-08-02 23:15:41skrahsetmessages: + msg167275
2012-08-02 23:09:48pitrousetmessages: + msg167274
2012-08-02 23:01:47teoliphantsetmessages: + msg167271
2012-08-02 22:28:03pitrousetnosy: + pitrou
messages: + msg167268
2012-08-02 22:13:58loewissetmessages: + msg167263
2012-08-02 22:02:10skrahsetnosy: + skrah
messages: + msg167262
2012-08-02 21:31:36Arfreversetnosy: + Arfrever
2012-08-02 21:16:22dmalcolmsetnosy: + teoliphant
2012-08-02 21:16:05dmalcolmsetnosy: + ncoghlan
2012-08-02 21:13:16alexsetnosy: + alex
2012-08-02 21:12:31dmalcolmsetnosy: + loewis, vstinner
2012-08-02 21:11:30dmalcolmsetfiles: + hack-out-test-against-MAX_UNICODE-from-cpython-3.3.patch
2012-08-02 21:10:45dmalcolmcreate