This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: slicing of memoryviews when itemsize != 1 is wrong
Type: behavior Stage: patch review
Components: Interpreter Core Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: amaury.forgeotdarc, belopolsky, benjamin.peterson, ncoghlan, pitrou, teoliphant
Priority: release blocker Keywords: patch

Created on 2008-12-07 20:31 by pitrou, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue4580.patch pitrou, 2008-12-08 23:41
issue4580ws.patch pitrou, 2008-12-09 15:03
issue4580-2.patch pitrou, 2008-12-10 17:30
Messages (28)
msg77248 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-07 20:31
The problem is with the `shape` member which doesn't get recalculated as
it should when instantiating a memoryview slice:

>>> a = array('i', range(10))
>>> m = memoryview(a)[2:8]
>>> a
array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> m[:] = array('i', range(6))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot modify size of memoryview object
>>> len(m)
6
>>> m.shape
(10,)

An additional problem is that `shape` is a pointer to an array of
integers, and we don't know how to reallocate it.
msg77367 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-08 23:41
Here is patch, using the "small storage at the end of Py_buffer" idea.
It adds a lot of unit tests as well.
msg77408 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-12-09 12:29
Please, can you provide a patch that does not change whitespace
everywhere? Even if these files use indentation inconsistently; the
patch will be smaller and much easier to proofread.

(I vaguely remember a document saying that it's better to separate
whitespace change from real development, but I cannot find it.)
msg77413 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-09 13:37
Copy & paste from python-dev post:

The problem with memoryview appears to be related to the way it
calculates its own length (since that is the check that is failing when
the view blows up):

>>> a = array('i', range(10))
>>> m = memoryview(a)
>>> len(m) # This is the length in bytes, which is WRONG!
40
>>> m2 = memoryview(a)[2:8]
>>> len(m2) # This is correct
6
>>> a2 = array('i', range(6))
>>> m[:] = a    # But this works
>>> m2[:] = a2  # and this does not
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot modify size of memoryview object
>>> len(memoryview(a2)) # Ah, 24 != 6 is our problem!
24

Looks to me like there are a couple of bugs here:

The first is that memoryview is treating the len field in the Py_buffer
struct as the number of objects in the view in a few places instead of
as the total number of bytes being exposed (it is actually the latter,
as defined in PEP 3118).

The second is that the getbuf implementation in array.array is broken.
It is ONLY OK for shape to be null when ndim=0 (i.e. a scalar value). An
array is NOT a scalar value, so the array objects should be setting the
shape pointer to point to an single item array (where shape[0] is the
length of the array).

memoryview can then be fixed to use shape[0] instead of len to get the
number of objects in the view.
msg77419 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-09 15:03
Sorry for the whitespace changes, here is a patch that has less of
them... hope that helps ;-S
msg77456 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-09 21:12
The patch still gets the length of memory view objects wrong - it just
makes it *consistently* wrong so that the specific assignment being
tested appears to work.

Note the following behaviour with the current memoryview :

>>> from array import array
>>> a10 = array('i', range(10))
>>> a6 = array('i', range(6))
>>> m10 = memoryview(a10)
>>> m6 = memoryview(a6)
>>> m = m10[2:8]
>>> len(m10)
40
>>> len(m6)
24
>>> len(m)
6

The patch will change the last item to 24 which will make the assignment
work, but it's still not the right answer!

As it turns out though, in addition to the length problems, this is far
from the only problem with memory view assignment - the damn things are
so finicky, it's practically impossible to assign anything to them:

>>> a = array('i', range(10))
>>> m = memoryview(a)
>>> m[1] = 1 # An integer perhaps?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'int' does not have the buffer interface
>>> m[1] = array('i', range(1)) # A single valued array?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: cannot modify size of memoryview object
>>> m[1] = b"1234" # A bytes object of the correct size?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: mismatching item sizes for "array.array" and "bytes"
>>> m[:1] = [0] # How about a sequence?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'list' does not have the buffer interface
>>> m[1] = memoryview(array('i', range(1)))[:1] # Force the returned
'len' to be wrong (but right for what memoryview is going to do with it)
and the assignment works

I guess fixing most of those qualify as enhancements though, as opposed
to the length calculation which is simply an outright bug.

So...

1. memoryview needs to be fixed so that internally self->view.len is
always the length in bytes, even after taking a slice of the view

2. memoryview needs to be fixed so that len() reflects the number of
items in the view (i.e. self->view.shape[0]), not the total number of bytes

3. array.array needs to be fixed so that it populates the shape array (a
pointer to the arrays own internal length field should be adequate)

4. PyObject_GetBuffer needs to raise an exception if it sees ndim > 0
with a NULL shape array in the resulting Py_buffer. The only time
shape=NULL is allowed is when ndim=0 (i.e. scalar values) because at
that point the length of the shape array is 0. The /* XXX */ comment in
the memoryview code about shape not being required by PEP 3118 needs to
go away because it *is* required - array.array is just broken in not
providing it.
msg77470 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-09 23:21
Hi Nick,

> 1. memoryview needs to be fixed so that internally self->view.len is
> always the length in bytes, even after taking a slice of the view

This is in my patch, unless I'm missing something?

> 2. memoryview needs to be fixed so that len() reflects the number of
> items in the view (i.e. self->view.shape[0]), not the total number of bytes

Why should that be? There's nothing suggesting that in the PEP. On the
other hand I agree it looks a bit higher-level.

> 3. array.array needs to be fixed so that it populates the shape array (a
> pointer to the arrays own internal length field should be adequate)

array.array is broken but that's a separate issue IMO (see e.g. #4509).
This issue is for fixing memoryview.

> 4. PyObject_GetBuffer needs to raise an exception if it sees ndim > 0
> with a NULL shape array in the resulting Py_buffer.

I'm for it, except that we must make sure that nobody really relies on
the current behaviour. I'm not sure there's only array.array.
msg77533 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-10 11:07
The reason memoryview's current len() implementation is wrong is because
its indexing is per element in the original object, not per byte:

>>> a = array('i', range(10))
>>> m = memoryview(a)
>>> for i in range(len(m)):
...   print(m[i])
...
b'\x00\x00\x00\x00'
b'\x01\x00\x00\x00'
b'\x02\x00\x00\x00'
b'\x03\x00\x00\x00'
b'\x04\x00\x00\x00'
b'\x05\x00\x00\x00'
b'\x06\x00\x00\x00'
b'\x07\x00\x00\x00'
b'\x08\x00\x00\x00'
b'\t\x00\x00\x00'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
IndexError: index out of bounds

Oops.
msg77534 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-10 11:09
Regarding ndim > 0 and shape=NULL, the PEP doesn't allow it. Any code
that does it is already broken - we're just making it fail cleanly at
the point where it is broken rather than in some obscure fashion
somewhere else.
msg77538 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-10 12:57
Thanks for your two messages, Nick, they make things clearer for me and
I'll make the necessary modifications in the patch some time today.
msg77547 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2008-12-10 15:05
I will take some time in the next week to look at this issue.  Thank you
for bringing these bugs to my attention.  

I did not finish the memoryview implementation, due to other demands on
my time.  I appreciate the efforts of so many to pick up and understand
the PEP and contribute code.   I've noticed some incorrect statements
being passed around, and I apologize for not providing a perspective on
them.   

I hope there is a clear distinction between the buffer protocol and the
memoryview object.   The buffer protocol was thought through fairly
well, but the memoryview object has not received the same attention.  My
mental model of the memoryview object is basically an (enhanced) NumPy
array without the math support.  

It does not suprise me that there are a few remaining issues in the
memoryview object.  However, I'm confident they can be cleared up.   A
few changes to the code were made by others without changing the PEP.  I
accept responsibility for this because people were cleaning up what I
should have already finished.  

All the help is appreciated.  I will review the patches and fix the
problem.
msg77548 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2008-12-10 15:13
My perspective on statements made:
 
 * Memoryview object should report it's length as self->view.shape[0]
unless self->view.shape is NULL (indicating in this case a 0-d array or
scalar).  In this case, it should raise an error. 

 * The buffer protocol is clear about who owns the memory for shape and
strides (and suboffsets).  The exporter does and it is responsible for
not changing them until releasebuffer is called.

 * It should also be clear that shape, strides, and suboffsets will be
NULL in exactly two cases
   1) The corresponding flag was not set indicating the consumer is not
interested in shape, strides, or suboffsets
   2) ndim == 0 indicating a 0-dimensional array (scalar-like).
msg77552 - (view) Author: Travis Oliphant (teoliphant) * (Python committer) Date: 2008-12-10 15:33
Another comment on statements made

  * I don't see where array.array getbuf implementation is broken.  It
looks correct to me.  It sets view.shape to NULL unless the consumer
asked for the shape information to be reported in which case it sets it
equal to a pointer to the number of elements in the array.
msg77553 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-10 16:16
>   * I don't see where array.array getbuf implementation is broken.  It
> looks correct to me.  It sets view.shape to NULL unless the consumer
> asked for the shape information to be reported in which case it sets it
> equal to a pointer to the number of elements in the array.

What I meant by that is that array.array frequently bypasses its own
resize helper and calls PyMem_Resize directly instead. This is the
likely cause of #4583.
msg77554 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-10 16:20
Thanks for your comments. I'll provide an updated patch later today and
let you take a look.

>  * Memoryview object should report it's length as self->view.shape[0]
> unless self->view.shape is NULL (indicating in this case a 0-d array or
> scalar).  In this case, it should raise an error.

Are 0-d arrays (scalars) really useful? Could we make things simpler by
simply removing support for them?

>  * The buffer protocol is clear about who owns the memory for shape and
> strides (and suboffsets).  The exporter does and it is responsible for
> not changing them until releasebuffer is called.

But how do we do when slicing a memoryview? The new (sliced) memoryview
object clearly must alter the shape after the buffer has been filled in
by the exporter.
msg77560 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-10 17:29
New patch addressing the above issues. You can also review it at
http://codereview.appspot.com/10441
msg77567 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-10 20:47
It's quite possible that I misread the array.array implementation since
I only glanced at the code where it fills in the Py_buffer details - I
saw the point where it set shape to NULL, but there may have been code
later on to fill it in correctly that I missed.

Regardless, a coupe of sanity checks on the Py_buffer contents before
returning from PyObject_GetBuffer may still be useful to avoid the need
for buffer protocol consumers to repeat them all the time (I see this as
similar to the type-correctness and range checks we do for things like
the index protocol).

Antoine, regarding the shapes and strides info for subviews: don't think
of it as reallocating or altering the shape of the underlying buffer.
Think of it as having separate shape and stride information for the
contents of the underlying buffer (i.e. what memoryview gets back from
the PyObject_GetBuffer call) and for the view it exposes through
indexing and the like (i.e. taking into account the start and stop
offsets from the slicing operation). This is what I mean when I say that
the current memoryview relies too heavily on the Py_buffer that
describes the underlying object - the memoryview needs to maintain its
own state *outside* that description. It just so happens that fully
describing an arbitrary fragment of the underlying buffer will take a
number of the fields that would exist in a separate Py_buffer struct
(specifically len, ndim, shape, strides and offsets if we want to cover
the non-contiguous and ndim > 1 cases - for the contiguous 1-D cases, we
don't needs strides or offsets).

Once we consider that aspect, I think it makes the most sense to just
maintain 2 Py_buffer instances inside the memoryview - one that
describes the underlying buffer, and one that describes the memoryview
itself after slicing is taken into account.

I also think it is worth considering changing the memoryview to also
take start/stop/step arguments in addition to the object to be viewed
(initially only supporting step=1, just like slicing, but we should be
able to lift that limitation as the implementation matures). The
FromBuffer C-level constructor could probably go away at that point.
msg77568 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-10 21:02
Antoine's latest patch looks to me like it will fix the current actual
errors in the 1-D case.

Something that may also be considered part of this bug is the fact that
"m[i] = m[i]" doesn't currently work for itemsize > 1.

There is a XXX comment in the memory view implementation as to whether
we should require "itemsize == itemsize" in addition to "len == len"
when replacing sections of a memoryview. I think the failure of
self-assignment shows that we should only be caring about whether or not
the total memory size being overwritten matches, rather than the
underlying format or size.

Alternatively, we could just special case it and say that assignemnt
from buffers with itemsize=1 (i.e. raw bytes) is always acceptable,
regardless of the itemsize of the buffer underlying the memoryview.
msg77606 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-11 12:43
> Antoine, regarding the shapes and strides info for subviews: don't think
> of it as reallocating or altering the shape of the underlying buffer.
> Think of it as having separate shape and stride information for the
> contents of the underlying buffer

I'm trying to be practical. If the additional shape/strides storage is
in Py_buffer, it can benefit anyone wanting to create a subview or doint
to do something else with shape and strides. If the storage is in
memoryview, it only benefits memoryview.

> I also think it is worth considering changing the memoryview to also
> take start/stop/step arguments in addition to the object to be viewed
> (initially only supporting step=1, just like slicing, but we should be
> able to lift that limitation as the implementation matures). The
> FromBuffer C-level constructor could probably go away at that point.

Could you open a separate bug for this? I agree simplifications are
welcome. As for the additional arguments, if we have proper slicing
support I'm not sure they are really useful (at the Python level at least).
msg77629 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-11 20:37
> I'm trying to be practical. If the additional shape/strides storage is
> in Py_buffer, it can benefit anyone wanting to create a subview or doint
> to do something else with shape and strides. If the storage is in
> memoryview, it only benefits memoryview.

I'm not sure I follow that - no matter where they're stored, doing
PyObject_GetBuffer on the memoryview should use the *memoryview's*
shape/stride information rather than that of the underlying object.
That's how getting this right in memoryview can help third parties: to
get a subview of something else, take a memoryview of it, slice the
memoryview, then get the Py_buffer information from the memoryview slice.

Hence my opinion that it makes sense to have an easy API (at least at
the C level) to create a memoryview given both an original object and
start/stop/step information - not only will it be useful internally for
the implementation of proper multi-dimensional slicing support in
memoryview itself, but it will also be convenient for third parties
trying to use or support the PEP 3118 protocol.

I agree wholeheartedly with your complaints about Py_buffer and the
PyObject_GetBuffer protocol being somewhat awkward to use. However, I
don't agree that it is a significant problem. One focus for the
underlying protocol design was on being fairly minimalist while still
providing sufficient expressiveness to cover all of the use cases
identified by the number crunching crowd - as Greg pointed out on
python-dev, it's best to think of Py_buffer's role in the protocol as
merely a location to store a fairly large set of output variables
describing the internal storage of the object supplying the data buffer.

The convenience layer that you're after (the one that will make the PEP
3118 API simpler to use by supporting normal PyObject semantics and easy
creation of subviews) should actually *be* memoryview, but the
implementation isn't there yet.
msg77647 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-11 23:36
> > I'm trying to be practical. If the additional shape/strides storage is
> > in Py_buffer, it can benefit anyone wanting to create a subview or doint
> > to do something else with shape and strides. If the storage is in
> > memoryview, it only benefits memoryview.
> 
> I'm not sure I follow that

Not everyone wanting to do something with a Py_buffer constructs a
memoryview, so it sounds sensible to give Py_buffer users as much power
as memoryview users.

Otherwise, let's kill Py_buffer as an independent struct, keep
memoryview as the only building block of the buffer API... something
which I totally agree with (but apparently others don't).

> it's best to think of Py_buffer's role in the protocol as
> merely a location to store a fairly large set of output variables
> describing the internal storage of the object supplying the data buffer.

And I think I've made my point that it's a broken design principle,
leading to two types instead of one.

May I remind you that:
  Simple is better than complex.
  Complex is better than complicated.
?
msg77662 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-12 09:35
I still don't see the problem:

Py_buffer is a minimal description of a region of memory. No more, no
less. It doesn't *do* anything, it's just a description.

memoryview is an object that will (when it's finished) provide an easy
way to access the memory of another object.

Having a fairly raw low level API
(Py_buffer/PyObject_GetBuffer/PyBuffer_Release) and a higher level, more
convenient, but higher overhead API (memoryview) seems like a perfectly
natural split to me. Multi-level APIs are a great way of freeing the
more abstract interface from the need to handle every single possible
edge case - we can quite reasonably tell people that care about the more
obscure edge cases to drop down a level and handle them with the raw API.

We just need to finish the job of making the convenience wrapper
actually able to handle the task we would like it to handle. Your latest
patch goes a long way towards doing that for the 1 dimensional case - if
Travis doesn't get to it first, I plan on taking a look a closer look at
it this weekend.
msg77663 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-12-12 09:47
Trying another way of getting the point across... if Py_buffer wasn't
defined in PEP 3118, there would have to be some *other* API there
whereby the memoryview implementation could ask an object for a
description of the data layout of the object's buffer.

It is Py_buffer which allows the implementation of memoryview to even be
contemplated.

The one point where the PEP is silent (and I think this might be what
you've been attempting to get at) is as to whether or not it is OK for a
client to *modify* the contents of the Py_buffer structure before
passing it back in to PyBuffer_Release.

And to that I would say that a robust type implementation shouldn't be
assuming *anything* about the contents of the Py_buffer passed to
PyBuffer_Release. If it needs to do more than decrement a reference
count when the buffer is released (e.g. releasing memory allocated for
shapes and strides information), then it should be doing its own
internal bookkeeping based on the address of the Py_buffer object
originally passed to PyObject_GetBuffer.
msg77671 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-12 14:27
> We just need to finish the job of making the convenience wrapper
> actually able to handle the task we would like it to handle. Your latest
> patch goes a long way towards doing that for the 1 dimensional case - if
> Travis doesn't get to it first, I plan on taking a look a closer look at
> it this weekend.

Ok, thanks. I hope we can at least solve this bunch of severe problems
before 3.0.1 :-)
msg77927 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-16 20:33
Any news?
msg78465 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-29 17:26
Nick, sorry for waving at you again, but do you have time for a review?
Otherwise, is it ok if I commit the patch as is?
(it solves the problems and there's no API or feature change anyway)
msg78929 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2009-01-03 03:55
Finally got around to looking at the most recent patch
(issue4580-2.patch). It looks good to me.

I still think something needs to change when it comes to item assignment
for memoryviews with itemsize > 1, but that's not as urgent as fixing
the rest of the slicing behaviour (and probably should be a different
issue).
msg78977 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-03 17:11
The patch is committed in r68200, r68202. Thanks!
History
Date User Action Args
2022-04-11 14:56:42adminsetnosy: + benjamin.peterson
github: 48830
2009-01-03 17:14:34pitrousetstatus: open -> closed
resolution: accepted -> fixed
2009-01-03 17:11:33pitrousetmessages: + msg78977
2009-01-03 13:00:30loewissetassignee: pitrou
resolution: accepted
2009-01-03 12:59:58loewissetkeywords: - needs review
2009-01-03 03:55:51ncoghlansetmessages: + msg78929
2008-12-29 17:26:44pitrousetmessages: + msg78465
2008-12-19 22:45:20barrysetpriority: high -> release blocker
2008-12-16 20:33:56pitrousetmessages: + msg77927
2008-12-12 14:27:37pitrousetmessages: + msg77671
2008-12-12 09:47:03ncoghlansetmessages: + msg77663
2008-12-12 09:36:00ncoghlansetmessages: + msg77662
2008-12-11 23:36:36pitrousetmessages: + msg77647
2008-12-11 20:37:20ncoghlansetmessages: + msg77629
2008-12-11 12:43:04pitrousetmessages: + msg77606
2008-12-10 21:02:04ncoghlansetmessages: + msg77568
2008-12-10 20:54:27belopolskysetnosy: + belopolsky
2008-12-10 20:47:37ncoghlansetmessages: + msg77567
2008-12-10 17:30:03pitrousetfiles: + issue4580-2.patch
messages: + msg77560
2008-12-10 16:20:35pitrousetmessages: + msg77554
2008-12-10 16:16:59pitrousetmessages: + msg77553
2008-12-10 15:33:42teoliphantsetmessages: + msg77552
2008-12-10 15:13:34teoliphantsetmessages: + msg77548
2008-12-10 15:05:04teoliphantsetmessages: + msg77547
2008-12-10 12:57:39pitrousetmessages: + msg77538
2008-12-10 11:09:39ncoghlansetmessages: + msg77534
2008-12-10 11:07:37ncoghlansetmessages: + msg77533
2008-12-09 23:21:16pitrousetmessages: + msg77470
2008-12-09 21:12:35ncoghlansetmessages: + msg77456
2008-12-09 15:03:27pitrousetfiles: + issue4580ws.patch
messages: + msg77419
2008-12-09 13:37:50ncoghlansetnosy: + ncoghlan
messages: + msg77413
2008-12-09 12:29:16amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg77408
2008-12-08 23:53:42pitrousetnosy: + teoliphant
2008-12-08 23:41:28pitrousetkeywords: + patch, needs review
files: + issue4580.patch
messages: + msg77367
stage: test needed -> patch review
2008-12-07 20:31:46pitroucreate