classification
Title: Tighten definition of bytes-like objects
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: docs@python, ezio.melotti, martin.panter, pitrou, python-dev, r.david.murray, serhiy.storchaka, skrah
Priority: normal Keywords: patch

Created on 2015-03-24 08:25 by martin.panter, last changed 2015-08-08 12:37 by skrah. This issue is now closed.

Files
File name Uploaded Description Edit
c-contig.patch martin.panter, 2015-04-01 11:33 review
c-contig.v2.patch martin.panter, 2015-04-03 11:37 review
c-contig.v3.patch martin.panter, 2015-07-29 04:34 review
Messages (23)
msg239097 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-03-24 08:25
There are moves at documenting and implementing support for “bytes-like” objects in more APIs, such as the “io” module (Issue 20699), http.client (Issue 23740). The glossary definition is currently “An object that supports the Buffer Protocol, like bytes, bytearray or memoryview.” This was originally added for Issue 16518. However after reading Issue 23688, I realized that it should probably not mean absolutely _any_ object supporting the buffer protocol. For instance:

>>> reverse_view = memoryview(b"123")[::-1]
>>> stdout.buffer.write(reverse_view)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BufferError: memoryview: underlying buffer is not C-contiguous

I think the definition should at least be tightened to only objects with a contiguous buffer, and “contiguous” should be defined (probably in the linked C API page or the memoryview.contiguous flag definition, not the glossary). So far, my understanding is these are contiguous:

* A zero-dimensional object, such as a ctypes object
* An multi-dimensional array with items stored contiguously in order of increasing indexes. I.e. a_2,2 is stored somewhere after both a_1,2 and a_2,1, and all the strides are positive.

and these are not contiguous:

* memoryview(contiguous)[::2], because there are memory gaps between the items
* memoryview(contiguous)[::-1], despite there being no gaps nor overlapping items
* Views that set the “suboffsets” field (i.e. include pointers to further memory)
* Views where different array items overlap each other (e.g. 0 in view.strides)

Perhaps the bytes-like definition should tightened further, to match the above error message, to only “C-contiguous” buffers. I understand that C-contiguous means the strides tuple has to be in non-strict decreasing order, e.g. for 2 × 1 × 3 arrays, strides == (3, 3, 1) is C-contiguous, but strides == (1, 3, 3) is not. This also needs documenting.

I’m not so sure about these, but the definition could be tightened even further:

* Require memoryview(x).cast("B") to be supported. Otherwise, native Python code would have to use workarounds like struct.pack_into() to write to the “bytes-like” object. See Issue 15944.
* Require len(view) == view.nbytes. This would help in some cases avoid the bug that I have seen of code naively calling len(data), but the downside is ctypes objects would no longer be considered bytes-like objects.
msg239101 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-24 09:20
Totally agree. Current definition is too wide. Actually in different places the term "bytes-like object" can imply different requirements.

* Supports buffer protocol. list isn't.

* Contiguous. memoryview()[::2] isn't.

* len() returns bytes size. array('I') isn't.

* Supported indexing (and slicing) of bytes. array('I') isn't.

* Indexing returns integers in the range 0-255. array('b') isn't.

* Supports concatenation. memoryview isn't.

* Supports common bytes and bytearray methods, such as find() or lower().

* A subclass of (bytes, bytearray).

* A subclass of bytes.

* A bytes itself.
msg239136 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-03-24 15:00
This not dissimilar to the problem we have with "file like object" or "file object".  The requirements on them are not consistent.  I'm not sure what the solution is in either case, but for file like objects we have decided to ignore the issue, I think.  (ie: what exactly a file like object needs to support in order to work with a give API depends on that API, and we just live with that).
msg239783 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-01 11:33
After doing a bit of reading and experimenting, I think we should at least restrict bytes-like objects to “C-contiguous”. Any looser definition risks memoryview(byteslike).tobytes() returning the bytes in a different order than their true in-memory order. Fortran-style contiguous arrays aren’t enough:

>>> import _testbuffer, sys
>>> fortran = memoryview(_testbuffer.ndarray([11, 12, 21, 22], format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0))
>>> fortran.f_contiguous
True
>>> fortran.c_contiguous
False
>>> fortran.tolist()
[[11, 21], [12, 22]]
>>> tuple(bytes(fortran))
(11, 21, 12, 22)
>>> sys.stdout.buffer.write(fortran)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BufferError: memoryview: underlying buffer is not C-contiguous

So I am proposing a patch which:

* Restricts the bytes-like object definition to C-contiguous buffers
* Explains what I think is actually meant by “contiguous” in the C API buffer protocol page. Turns out it is generally a more strict definition than I originally assumed.
* Explains why memoryview.tobytes() is out of order for non C-contiguous buffers
* Has a couple other fixes taking into acount memoryview.tolist() doesn’t work for zero dimensions, and is nested for more than one dimension
msg239784 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-01 11:44
+1 for the idea overall and the patch LGTM.
msg239795 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-04-01 12:35
I have a somewhat general concern: In the last year or so, issues seem
to expand far beyond the scope that's indicated by the issue title.

For example, I don't particularly care about the definition of
"bytes-like", but the patch contains changes to areas I *do* care
about.

I don't think all of the changes are an improvement: What is
the "flattened length", why does C-contiguous have to be explained?
msg239815 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-01 14:01
I found the explanation of C-contiguous vs Fortran-contiguous helpful (and I've programmed in both of those languages, though granted not much :).  However, based on that it is not obvious to me why having a fortran-contiguous buffer prevents it from being used in the bytes-like object contexts (though granted the order might be surprising to someone who is not thinking about the memory ordering and just assuming C).

I don't have much of an opinion on the other non-glossary-entry changes, but at a quick read I'm not sure how much clarity they add, if any.
msg239816 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-01 14:05
Oh, and about the general concern: I agree that this issue was apparently about the glossary entry, so making other changes is suspect and at a minimum requires adding relevant people from the experts list to nosy to get proper review of other proposed changes.
msg239817 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-01 14:24
What people are needed? The patch looks as great improvement to me.
msg239818 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-01 14:43
Stefan, since he's the current maintainer of the memoryview implementation.  Fortunately he spotted the issue :)  Maybe Antoine, too; he's done work in this area.  I'll add him.
msg239819 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-01 15:06
See also issue23376.
msg239851 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-01 21:37
I will to pull the stdtypes.rst changes out into a separate patch and issue, if that will make review easier. I think they are an improvement because the previous version was incorrect and misleading, but they are probably not necessary for people to understand what a C-contiguous bytes-like object is.

I don’t think “flattened length” is explicitly defined anywhere, but it is already used in the memoryview() documentation and elsewhere. I took it to mean the number of elements in the nested list, if you ignore the fact that it is nested; i.e. ignore the extra "[" and "]" delimiters in the repr(), and just count the other values inside them.

The reason for defining C-contiguous was that I originally understood “contiguous” to be much more general than what seems to be meant. I assumed both memoryview(contiguous)[::-1] and a 2×2×2 array with stride=[4, 1, 2] would be contiguous, although neither have the C or Fortran array layout.

I think we need to define C-contiguous well enough for people to understand which standard Python objects are bytes-like objects. Maybe not Fortran-contiguous, because it doesn’t seem relevant to standard Python objects. Considering Serhiy asked if array.array() is always C-contiguous, I may not have done enough to explain that. (I think the answer is always yes.)

David: If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order.
msg239901 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-04-02 11:10
If you think that the previous version was "incorrect and misleading",
the bar for changes to be accepted seems pretty high for me.

"grep -r" doesn't seem to find "flattened length" in Doc/*.

"An object that supports the :ref:`bufferobject` and is C-contiguous, like :class:`bytes`, :class:`bytearray` or :class:`memoryview`."


This implies that all memoryviews are C-contiguous.
msg239919 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-02 14:57
> If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order.

I'm still not seeing how this would cause such an object to throw an error if used in a bytes-like context.  I presume by the above that you mean that the results of passing the object directly to a bytes like context differs from the results of calling .tobytes() on it and passing *that* to the bytes like context.  That's not what your suggested documentation change says, though.
msg239971 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-03 11:37
I’m sorry Stefan, I now realize my changes for len(view) were indeed wrong, and the original was much more correct. I still think the tobytes() and maybe tolist() documentation could be improved, but that is a separate issue to the bytes-like definition.

I am posting c-contig.v2.patch. Hopefully you will find it is truer to my original scope :)

* Removed changes to stdtypes.rst
* Scaled back changes in buffer.rst to only explain “C-contiguous”
* Tweaked glossary definition. Not all memoryview() objects are applicable.

David: The result of passing a Fortran array directly in a bytes-like context is typically BufferError. If this were relaxed, then we would get the inconsistency with tobytes().

>>> import _testbuffer, sys
>>> layout = [11, 21, 12, 22]
>>> fortran_array = _testbuffer.ndarray(layout, format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0)
>>> sys.stdout.buffer.write(fortran_array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BufferError: ndarray is not C-contiguous
>>> list(memoryview(fortran_array).tobytes())  # C-contiguous order!
[11, 12, 21, 22]
msg247464 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-07-27 12:36
Sorry, I'm still not convinced that the C-contiguity explanation is in the right place. The docs have to be terse in order to be useful as a reference, and the explanation at that particular location breaks the flow of reading.  So, please don't commit that.


The glossary update looks good to me. -- C-contiguity could also be
explained in the glossary, but I prefer this explanation from

http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html

"Specify the order of the array. If order is ‘C’ (default), then the array will be in C-contiguous order (last-index varies the fastest). If order is ‘F’, then the returned array will be in Fortran-contiguous order (first-index varies the fastest)."
msg247558 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-07-29 04:34
Patch v3:

* Merged with recent changes
* New glossary entry for “contiguous”, with incoming links from various points
* Removed my definition from buffer.rst

I found the Num Py explanation a bit brief, assuming what you quoted was the extent of it. I used some of that wording, and added a bit more, although it is still not a complete definition. Let me know if you think this version is acceptable.
msg247603 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-07-29 18:49
c-contig.v3.patch LGTM.
msg248263 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-08-08 12:09
I would commit this, except that I'm not too happy with the use of
the term "bytes-like" in general. Yesterday I mistyped this:

>>> import ctypes
>>> x = ctypes.c_double
>>> m = memoryview(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: memoryview: a bytes-like object is required, not '_ctypes.PyCSimpleType'


The previous error message was (changed in #16518) was:

"_ctypes.PyCSimpleType does not support the buffer interface".


Which I find much clearer. Memoryviews (for better or worse,
but PEP-3118 was accepted) are Mini-NumPy-arrays. I'm still not
sure if we should hide that from users.
msg248264 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-08-08 12:10
The thing is, most users don't know what the buffer protocol is (even Numpy users, probably), while "bytes-like" at least will make some sense - even though I agree it's an imperfect approximation.
msg248265 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-08-08 12:29
For end users it's probably adequate. But several committers find
the term confusing (msg236283, msg188484). :)

Anyway, I'm going to commit this since it adds clarity.
msg248266 - (view) Author: Roundup Robot (python-dev) Date: 2015-08-08 12:34
New changeset d1ef54751412 by Stefan Krah in branch '3.5':
Issue #23756: Clarify the terms "contiguous" and "bytes-like object".
https://hg.python.org/cpython/rev/d1ef54751412

New changeset c59b2c4f4cac by Stefan Krah in branch 'default':
Merge #23756.
https://hg.python.org/cpython/rev/c59b2c4f4cac
msg248267 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-08-08 12:37
Martin, thanks for the patch!
History
Date User Action Args
2015-08-08 12:37:35skrahsetstatus: open -> closed
versions: + Python 3.6, - Python 3.4
messages: + msg248267

resolution: fixed
stage: patch review -> resolved
2015-08-08 12:34:51python-devsetnosy: + python-dev
messages: + msg248266
2015-08-08 12:29:05skrahsetmessages: + msg248265
2015-08-08 12:10:47pitrousetmessages: + msg248264
2015-08-08 12:09:21skrahsetmessages: + msg248263
2015-07-29 18:49:40skrahsetmessages: + msg247603
2015-07-29 04:34:28martin.pantersetfiles: + c-contig.v3.patch

messages: + msg247558
2015-07-27 12:36:46skrahsetassignee: skrah ->
messages: + msg247464
2015-05-26 23:20:49martin.pantersetstage: needs patch -> patch review
2015-04-03 11:37:55martin.pantersetfiles: + c-contig.v2.patch

messages: + msg239971
2015-04-02 14:57:07r.david.murraysetmessages: + msg239919
2015-04-02 11:10:07skrahsetmessages: + msg239901
2015-04-02 10:42:47skrahsetassignee: docs@python -> skrah
2015-04-01 21:37:20martin.pantersetmessages: + msg239851
2015-04-01 15:06:10serhiy.storchakasetmessages: + msg239819
2015-04-01 14:43:23r.david.murraysetnosy: + pitrou
messages: + msg239818
2015-04-01 14:24:00serhiy.storchakasetmessages: + msg239817
2015-04-01 14:05:23r.david.murraysetmessages: + msg239816
2015-04-01 14:01:45r.david.murraysetmessages: + msg239815
2015-04-01 12:35:06skrahsetnosy: + skrah
messages: + msg239795
2015-04-01 11:44:28serhiy.storchakasetmessages: + msg239784
2015-04-01 11:33:05martin.pantersetfiles: + c-contig.patch
keywords: + patch
messages: + msg239783
2015-03-24 15:00:54r.david.murraysetmessages: + msg239136
2015-03-24 09:20:16serhiy.storchakasetnosy: + serhiy.storchaka, r.david.murray

messages: + msg239101
versions: + Python 3.4, Python 3.5
2015-03-24 08:53:56ezio.melottisetnosy: + ezio.melotti

type: enhancement
stage: needs patch
2015-03-24 08:25:17martin.pantercreate