Issue 4580: slicing of memoryviews when itemsize != 1 is wrong

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48830

classification

Title:	slicing of memoryviews when itemsize != 1 is wrong
Type:	behavior	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	pitrou	Nosy List:	amaury.forgeotdarc, belopolsky, benjamin.peterson, ncoghlan, pitrou, teoliphant
Priority:	release blocker	Keywords:	patch

Created on 2008-12-07 20:31 by pitrou, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue4580.patch	pitrou, 2008-12-08 23:41
issue4580ws.patch	pitrou, 2008-12-09 15:03
issue4580-2.patch	pitrou, 2008-12-10 17:30

Messages (28)
msg77248 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-07 20:31
The problem is with the `shape` member which doesn't get recalculated as it should when instantiating a memoryview slice: >>> a = array('i', range(10)) >>> m = memoryview(a)[2:8] >>> a array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> m[:] = array('i', range(6)) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot modify size of memoryview object >>> len(m) 6 >>> m.shape (10,) An additional problem is that `shape` is a pointer to an array of integers, and we don't know how to reallocate it.
msg77367 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-08 23:41
Here is patch, using the "small storage at the end of Py_buffer" idea. It adds a lot of unit tests as well.
msg77408 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-12-09 12:29
Please, can you provide a patch that does not change whitespace everywhere? Even if these files use indentation inconsistently; the patch will be smaller and much easier to proofread. (I vaguely remember a document saying that it's better to separate whitespace change from real development, but I cannot find it.)
msg77413 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-09 13:37
Copy & paste from python-dev post: The problem with memoryview appears to be related to the way it calculates its own length (since that is the check that is failing when the view blows up): >>> a = array('i', range(10)) >>> m = memoryview(a) >>> len(m) # This is the length in bytes, which is WRONG! 40 >>> m2 = memoryview(a)[2:8] >>> len(m2) # This is correct 6 >>> a2 = array('i', range(6)) >>> m[:] = a # But this works >>> m2[:] = a2 # and this does not Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot modify size of memoryview object >>> len(memoryview(a2)) # Ah, 24 != 6 is our problem! 24 Looks to me like there are a couple of bugs here: The first is that memoryview is treating the len field in the Py_buffer struct as the number of objects in the view in a few places instead of as the total number of bytes being exposed (it is actually the latter, as defined in PEP 3118). The second is that the getbuf implementation in array.array is broken. It is ONLY OK for shape to be null when ndim=0 (i.e. a scalar value). An array is NOT a scalar value, so the array objects should be setting the shape pointer to point to an single item array (where shape[0] is the length of the array). memoryview can then be fixed to use shape[0] instead of len to get the number of objects in the view.
msg77419 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-09 15:03
Sorry for the whitespace changes, here is a patch that has less of them... hope that helps ;-S
msg77456 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-09 21:12
The patch still gets the length of memory view objects wrong - it just makes it consistently wrong so that the specific assignment being tested appears to work. Note the following behaviour with the current memoryview : >>> from array import array >>> a10 = array('i', range(10)) >>> a6 = array('i', range(6)) >>> m10 = memoryview(a10) >>> m6 = memoryview(a6) >>> m = m10[2:8] >>> len(m10) 40 >>> len(m6) 24 >>> len(m) 6 The patch will change the last item to 24 which will make the assignment work, but it's still not the right answer! As it turns out though, in addition to the length problems, this is far from the only problem with memory view assignment - the damn things are so finicky, it's practically impossible to assign anything to them: >>> a = array('i', range(10)) >>> m = memoryview(a) >>> m[1] = 1 # An integer perhaps? Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'int' does not have the buffer interface >>> m[1] = array('i', range(1)) # A single valued array? Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot modify size of memoryview object >>> m[1] = b"1234" # A bytes object of the correct size? Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: mismatching item sizes for "array.array" and "bytes" >>> m[:1] = [0] # How about a sequence? Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'list' does not have the buffer interface >>> m[1] = memoryview(array('i', range(1)))[:1] # Force the returned 'len' to be wrong (but right for what memoryview is going to do with it) and the assignment works I guess fixing most of those qualify as enhancements though, as opposed to the length calculation which is simply an outright bug. So... 1. memoryview needs to be fixed so that internally self->view.len is always the length in bytes, even after taking a slice of the view 2. memoryview needs to be fixed so that len() reflects the number of items in the view (i.e. self->view.shape[0]), not the total number of bytes 3. array.array needs to be fixed so that it populates the shape array (a pointer to the arrays own internal length field should be adequate) 4. PyObject_GetBuffer needs to raise an exception if it sees ndim > 0 with a NULL shape array in the resulting Py_buffer. The only time shape=NULL is allowed is when ndim=0 (i.e. scalar values) because at that point the length of the shape array is 0. The /* XXX / comment in the memoryview code about shape not being required by PEP 3118 needs to go away because it is* required - array.array is just broken in not providing it.
msg77470 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-09 23:21
Hi Nick, > 1. memoryview needs to be fixed so that internally self->view.len is > always the length in bytes, even after taking a slice of the view This is in my patch, unless I'm missing something? > 2. memoryview needs to be fixed so that len() reflects the number of > items in the view (i.e. self->view.shape[0]), not the total number of bytes Why should that be? There's nothing suggesting that in the PEP. On the other hand I agree it looks a bit higher-level. > 3. array.array needs to be fixed so that it populates the shape array (a > pointer to the arrays own internal length field should be adequate) array.array is broken but that's a separate issue IMO (see e.g. #4509). This issue is for fixing memoryview. > 4. PyObject_GetBuffer needs to raise an exception if it sees ndim > 0 > with a NULL shape array in the resulting Py_buffer. I'm for it, except that we must make sure that nobody really relies on the current behaviour. I'm not sure there's only array.array.
msg77533 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-10 11:07
The reason memoryview's current len() implementation is wrong is because its indexing is per element in the original object, not per byte: >>> a = array('i', range(10)) >>> m = memoryview(a) >>> for i in range(len(m)): ... print(m[i]) ... b'\x00\x00\x00\x00' b'\x01\x00\x00\x00' b'\x02\x00\x00\x00' b'\x03\x00\x00\x00' b'\x04\x00\x00\x00' b'\x05\x00\x00\x00' b'\x06\x00\x00\x00' b'\x07\x00\x00\x00' b'\x08\x00\x00\x00' b'\t\x00\x00\x00' Traceback (most recent call last): File "<stdin>", line 2, in <module> IndexError: index out of bounds Oops.
msg77534 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-10 11:09
Regarding ndim > 0 and shape=NULL, the PEP doesn't allow it. Any code that does it is already broken - we're just making it fail cleanly at the point where it is broken rather than in some obscure fashion somewhere else.
msg77538 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-10 12:57
Thanks for your two messages, Nick, they make things clearer for me and I'll make the necessary modifications in the patch some time today.
msg77547 - (view)	Author: Travis Oliphant (teoliphant) *	Date: 2008-12-10 15:05
I will take some time in the next week to look at this issue. Thank you for bringing these bugs to my attention. I did not finish the memoryview implementation, due to other demands on my time. I appreciate the efforts of so many to pick up and understand the PEP and contribute code. I've noticed some incorrect statements being passed around, and I apologize for not providing a perspective on them. I hope there is a clear distinction between the buffer protocol and the memoryview object. The buffer protocol was thought through fairly well, but the memoryview object has not received the same attention. My mental model of the memoryview object is basically an (enhanced) NumPy array without the math support. It does not suprise me that there are a few remaining issues in the memoryview object. However, I'm confident they can be cleared up. A few changes to the code were made by others without changing the PEP. I accept responsibility for this because people were cleaning up what I should have already finished. All the help is appreciated. I will review the patches and fix the problem.
msg77548 - (view)	Author: Travis Oliphant (teoliphant) *	Date: 2008-12-10 15:13
My perspective on statements made: * Memoryview object should report it's length as self->view.shape[0] unless self->view.shape is NULL (indicating in this case a 0-d array or scalar). In this case, it should raise an error. * The buffer protocol is clear about who owns the memory for shape and strides (and suboffsets). The exporter does and it is responsible for not changing them until releasebuffer is called. * It should also be clear that shape, strides, and suboffsets will be NULL in exactly two cases 1) The corresponding flag was not set indicating the consumer is not interested in shape, strides, or suboffsets 2) ndim == 0 indicating a 0-dimensional array (scalar-like).
msg77552 - (view)	Author: Travis Oliphant (teoliphant) *	Date: 2008-12-10 15:33
Another comment on statements made * I don't see where array.array getbuf implementation is broken. It looks correct to me. It sets view.shape to NULL unless the consumer asked for the shape information to be reported in which case it sets it equal to a pointer to the number of elements in the array.
msg77553 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-10 16:16
> * I don't see where array.array getbuf implementation is broken. It > looks correct to me. It sets view.shape to NULL unless the consumer > asked for the shape information to be reported in which case it sets it > equal to a pointer to the number of elements in the array. What I meant by that is that array.array frequently bypasses its own resize helper and calls PyMem_Resize directly instead. This is the likely cause of #4583.
msg77554 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-10 16:20
Thanks for your comments. I'll provide an updated patch later today and let you take a look. > * Memoryview object should report it's length as self->view.shape[0] > unless self->view.shape is NULL (indicating in this case a 0-d array or > scalar). In this case, it should raise an error. Are 0-d arrays (scalars) really useful? Could we make things simpler by simply removing support for them? > * The buffer protocol is clear about who owns the memory for shape and > strides (and suboffsets). The exporter does and it is responsible for > not changing them until releasebuffer is called. But how do we do when slicing a memoryview? The new (sliced) memoryview object clearly must alter the shape after the buffer has been filled in by the exporter.
msg77560 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-10 17:29
New patch addressing the above issues. You can also review it at http://codereview.appspot.com/10441
msg77567 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-10 20:47
It's quite possible that I misread the array.array implementation since I only glanced at the code where it fills in the Py_buffer details - I saw the point where it set shape to NULL, but there may have been code later on to fill it in correctly that I missed. Regardless, a coupe of sanity checks on the Py_buffer contents before returning from PyObject_GetBuffer may still be useful to avoid the need for buffer protocol consumers to repeat them all the time (I see this as similar to the type-correctness and range checks we do for things like the index protocol). Antoine, regarding the shapes and strides info for subviews: don't think of it as reallocating or altering the shape of the underlying buffer. Think of it as having separate shape and stride information for the contents of the underlying buffer (i.e. what memoryview gets back from the PyObject_GetBuffer call) and for the view it exposes through indexing and the like (i.e. taking into account the start and stop offsets from the slicing operation). This is what I mean when I say that the current memoryview relies too heavily on the Py_buffer that describes the underlying object - the memoryview needs to maintain its own state outside that description. It just so happens that fully describing an arbitrary fragment of the underlying buffer will take a number of the fields that would exist in a separate Py_buffer struct (specifically len, ndim, shape, strides and offsets if we want to cover the non-contiguous and ndim > 1 cases - for the contiguous 1-D cases, we don't needs strides or offsets). Once we consider that aspect, I think it makes the most sense to just maintain 2 Py_buffer instances inside the memoryview - one that describes the underlying buffer, and one that describes the memoryview itself after slicing is taken into account. I also think it is worth considering changing the memoryview to also take start/stop/step arguments in addition to the object to be viewed (initially only supporting step=1, just like slicing, but we should be able to lift that limitation as the implementation matures). The FromBuffer C-level constructor could probably go away at that point.
msg77568 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-10 21:02
Antoine's latest patch looks to me like it will fix the current actual errors in the 1-D case. Something that may also be considered part of this bug is the fact that "m[i] = m[i]" doesn't currently work for itemsize > 1. There is a XXX comment in the memory view implementation as to whether we should require "itemsize == itemsize" in addition to "len == len" when replacing sections of a memoryview. I think the failure of self-assignment shows that we should only be caring about whether or not the total memory size being overwritten matches, rather than the underlying format or size. Alternatively, we could just special case it and say that assignemnt from buffers with itemsize=1 (i.e. raw bytes) is always acceptable, regardless of the itemsize of the buffer underlying the memoryview.
msg77606 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-11 12:43
> Antoine, regarding the shapes and strides info for subviews: don't think > of it as reallocating or altering the shape of the underlying buffer. > Think of it as having separate shape and stride information for the > contents of the underlying buffer I'm trying to be practical. If the additional shape/strides storage is in Py_buffer, it can benefit anyone wanting to create a subview or doint to do something else with shape and strides. If the storage is in memoryview, it only benefits memoryview. > I also think it is worth considering changing the memoryview to also > take start/stop/step arguments in addition to the object to be viewed > (initially only supporting step=1, just like slicing, but we should be > able to lift that limitation as the implementation matures). The > FromBuffer C-level constructor could probably go away at that point. Could you open a separate bug for this? I agree simplifications are welcome. As for the additional arguments, if we have proper slicing support I'm not sure they are really useful (at the Python level at least).
msg77629 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-11 20:37
> I'm trying to be practical. If the additional shape/strides storage is > in Py_buffer, it can benefit anyone wanting to create a subview or doint > to do something else with shape and strides. If the storage is in > memoryview, it only benefits memoryview. I'm not sure I follow that - no matter where they're stored, doing PyObject_GetBuffer on the memoryview should use the memoryview's shape/stride information rather than that of the underlying object. That's how getting this right in memoryview can help third parties: to get a subview of something else, take a memoryview of it, slice the memoryview, then get the Py_buffer information from the memoryview slice. Hence my opinion that it makes sense to have an easy API (at least at the C level) to create a memoryview given both an original object and start/stop/step information - not only will it be useful internally for the implementation of proper multi-dimensional slicing support in memoryview itself, but it will also be convenient for third parties trying to use or support the PEP 3118 protocol. I agree wholeheartedly with your complaints about Py_buffer and the PyObject_GetBuffer protocol being somewhat awkward to use. However, I don't agree that it is a significant problem. One focus for the underlying protocol design was on being fairly minimalist while still providing sufficient expressiveness to cover all of the use cases identified by the number crunching crowd - as Greg pointed out on python-dev, it's best to think of Py_buffer's role in the protocol as merely a location to store a fairly large set of output variables describing the internal storage of the object supplying the data buffer. The convenience layer that you're after (the one that will make the PEP 3118 API simpler to use by supporting normal PyObject semantics and easy creation of subviews) should actually be memoryview, but the implementation isn't there yet.
msg77647 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-11 23:36
> > I'm trying to be practical. If the additional shape/strides storage is > > in Py_buffer, it can benefit anyone wanting to create a subview or doint > > to do something else with shape and strides. If the storage is in > > memoryview, it only benefits memoryview. > > I'm not sure I follow that Not everyone wanting to do something with a Py_buffer constructs a memoryview, so it sounds sensible to give Py_buffer users as much power as memoryview users. Otherwise, let's kill Py_buffer as an independent struct, keep memoryview as the only building block of the buffer API... something which I totally agree with (but apparently others don't). > it's best to think of Py_buffer's role in the protocol as > merely a location to store a fairly large set of output variables > describing the internal storage of the object supplying the data buffer. And I think I've made my point that it's a broken design principle, leading to two types instead of one. May I remind you that: Simple is better than complex. Complex is better than complicated. ?
msg77662 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-12 09:35
I still don't see the problem: Py_buffer is a minimal description of a region of memory. No more, no less. It doesn't do anything, it's just a description. memoryview is an object that will (when it's finished) provide an easy way to access the memory of another object. Having a fairly raw low level API (Py_buffer/PyObject_GetBuffer/PyBuffer_Release) and a higher level, more convenient, but higher overhead API (memoryview) seems like a perfectly natural split to me. Multi-level APIs are a great way of freeing the more abstract interface from the need to handle every single possible edge case - we can quite reasonably tell people that care about the more obscure edge cases to drop down a level and handle them with the raw API. We just need to finish the job of making the convenience wrapper actually able to handle the task we would like it to handle. Your latest patch goes a long way towards doing that for the 1 dimensional case - if Travis doesn't get to it first, I plan on taking a look a closer look at it this weekend.
msg77663 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-12-12 09:47
Trying another way of getting the point across... if Py_buffer wasn't defined in PEP 3118, there would have to be some other API there whereby the memoryview implementation could ask an object for a description of the data layout of the object's buffer. It is Py_buffer which allows the implementation of memoryview to even be contemplated. The one point where the PEP is silent (and I think this might be what you've been attempting to get at) is as to whether or not it is OK for a client to modify the contents of the Py_buffer structure before passing it back in to PyBuffer_Release. And to that I would say that a robust type implementation shouldn't be assuming anything about the contents of the Py_buffer passed to PyBuffer_Release. If it needs to do more than decrement a reference count when the buffer is released (e.g. releasing memory allocated for shapes and strides information), then it should be doing its own internal bookkeeping based on the address of the Py_buffer object originally passed to PyObject_GetBuffer.
msg77671 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-12 14:27
> We just need to finish the job of making the convenience wrapper > actually able to handle the task we would like it to handle. Your latest > patch goes a long way towards doing that for the 1 dimensional case - if > Travis doesn't get to it first, I plan on taking a look a closer look at > it this weekend. Ok, thanks. I hope we can at least solve this bunch of severe problems before 3.0.1 :-)
msg77927 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-16 20:33
Any news?
msg78465 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-12-29 17:26
Nick, sorry for waving at you again, but do you have time for a review? Otherwise, is it ok if I commit the patch as is? (it solves the problems and there's no API or feature change anyway)
msg78929 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2009-01-03 03:55
Finally got around to looking at the most recent patch (issue4580-2.patch). It looks good to me. I still think something needs to change when it comes to item assignment for memoryviews with itemsize > 1, but that's not as urgent as fixing the rest of the slicing behaviour (and probably should be a different issue).
msg78977 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-01-03 17:11
The patch is committed in r68200, r68202. Thanks!

History
Date	User	Action	Args
2022-04-11 14:56:42	admin	set	nosy: + benjamin.peterson github: 48830
2009-01-03 17:14:34	pitrou	set	status: open -> closed resolution: accepted -> fixed
2009-01-03 17:11:33	pitrou	set	messages: + msg78977
2009-01-03 13:00:30	loewis	set	assignee: pitrou resolution: accepted
2009-01-03 12:59:58	loewis	set	keywords: - needs review
2009-01-03 03:55:51	ncoghlan	set	messages: + msg78929
2008-12-29 17:26:44	pitrou	set	messages: + msg78465
2008-12-19 22:45:20	barry	set	priority: high -> release blocker
2008-12-16 20:33:56	pitrou	set	messages: + msg77927
2008-12-12 14:27:37	pitrou	set	messages: + msg77671
2008-12-12 09:47:03	ncoghlan	set	messages: + msg77663
2008-12-12 09:36:00	ncoghlan	set	messages: + msg77662
2008-12-11 23:36:36	pitrou	set	messages: + msg77647
2008-12-11 20:37:20	ncoghlan	set	messages: + msg77629
2008-12-11 12:43:04	pitrou	set	messages: + msg77606
2008-12-10 21:02:04	ncoghlan	set	messages: + msg77568
2008-12-10 20:54:27	belopolsky	set	nosy: + belopolsky
2008-12-10 20:47:37	ncoghlan	set	messages: + msg77567
2008-12-10 17:30:03	pitrou	set	files: + issue4580-2.patch messages: + msg77560
2008-12-10 16:20:35	pitrou	set	messages: + msg77554
2008-12-10 16:16:59	pitrou	set	messages: + msg77553
2008-12-10 15:33:42	teoliphant	set	messages: + msg77552
2008-12-10 15:13:34	teoliphant	set	messages: + msg77548
2008-12-10 15:05:04	teoliphant	set	messages: + msg77547
2008-12-10 12:57:39	pitrou	set	messages: + msg77538
2008-12-10 11:09:39	ncoghlan	set	messages: + msg77534
2008-12-10 11:07:37	ncoghlan	set	messages: + msg77533
2008-12-09 23:21:16	pitrou	set	messages: + msg77470
2008-12-09 21:12:35	ncoghlan	set	messages: + msg77456
2008-12-09 15:03:27	pitrou	set	files: + issue4580ws.patch messages: + msg77419
2008-12-09 13:37:50	ncoghlan	set	nosy: + ncoghlan messages: + msg77413
2008-12-09 12:29:16	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg77408
2008-12-08 23:53:42	pitrou	set	nosy: + teoliphant
2008-12-08 23:41:28	pitrou	set	keywords: + patch, needs review files: + issue4580.patch messages: + msg77367 stage: test needed -> patch review
2008-12-07 20:31:46	pitrou	create