Tighten definition of bytes-like objects #67944

vadmium · 2015-03-24T08:25:17Z

BPO	23756
Nosy	@pitrou, @ezio-melotti, @bitdancer, @skrah, @vadmium, @serhiy-storchaka
Files	c-contig.patch c-contig.v2.patch c-contig.v3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-08-08.12:37:35.630>
created_at = <Date 2015-03-24.08:25:17.197>
labels = ['type-feature', 'docs']
title = 'Tighten definition of bytes-like objects'
updated_at = <Date 2015-08-08.12:37:35.627>
user = 'https://github.com/vadmium'

bugs.python.org fields:

activity = <Date 2015-08-08.12:37:35.627>
actor = 'skrah'
assignee = 'none'
closed = True
closed_date = <Date 2015-08-08.12:37:35.630>
closer = 'skrah'
components = ['Documentation']
creation = <Date 2015-03-24.08:25:17.197>
creator = 'martin.panter'
dependencies = []
files = ['38780', '38815', '40051']
hgrepos = []
issue_num = 23756
keywords = ['patch']
message_count = 23.0
messages = ['239097', '239101', '239136', '239783', '239784', '239795', '239815', '239816', '239817', '239818', '239819', '239851', '239901', '239919', '239971', '247464', '247558', '247603', '248263', '248264', '248265', '248266', '248267']
nosy_count = 8.0
nosy_names = ['pitrou', 'ezio.melotti', 'r.david.murray', 'skrah', 'docs@python', 'python-dev', 'martin.panter', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue23756'
versions = ['Python 3.5', 'Python 3.6']

vadmium · 2015-03-24T08:25:16Z

There are moves at documenting and implementing support for “bytes-like” objects in more APIs, such as the “io” module (bpo-20699), http.client (bpo-23740). The glossary definition is currently “An object that supports the Buffer Protocol, like bytes, bytearray or memoryview.” This was originally added for bpo-16518. However after reading bpo-23688, I realized that it should probably not mean absolutely _any_ object supporting the buffer protocol. For instance:

>>> reverse_view = memoryview(b"123")[::-1]
>>> stdout.buffer.write(reverse_view)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BufferError: memoryview: underlying buffer is not C-contiguous

I think the definition should at least be tightened to only objects with a contiguous buffer, and “contiguous” should be defined (probably in the linked C API page or the memoryview.contiguous flag definition, not the glossary). So far, my understanding is these are contiguous:

A zero-dimensional object, such as a ctypes object
An multi-dimensional array with items stored contiguously in order of increasing indexes. I.e. a_2,2 is stored somewhere after both a_1,2 and a_2,1, and all the strides are positive.

and these are not contiguous:

memoryview(contiguous)[::2], because there are memory gaps between the items
memoryview(contiguous)[::-1], despite there being no gaps nor overlapping items
Views that set the “suboffsets” field (i.e. include pointers to further memory)
Views where different array items overlap each other (e.g. 0 in view.strides)

Perhaps the bytes-like definition should tightened further, to match the above error message, to only “C-contiguous” buffers. I understand that C-contiguous means the strides tuple has to be in non-strict decreasing order, e.g. for 2 × 1 × 3 arrays, strides == (3, 3, 1) is C-contiguous, but strides == (1, 3, 3) is not. This also needs documenting.

I’m not so sure about these, but the definition could be tightened even further:

Require memoryview(x).cast("B") to be supported. Otherwise, native Python code would have to use workarounds like struct.pack_into() to write to the “bytes-like” object. See bpo-15944.
Require len(view) == view.nbytes. This would help in some cases avoid the bug that I have seen of code naively calling len(data), but the downside is ctypes objects would no longer be considered bytes-like objects.

serhiy-storchaka · 2015-03-24T09:20:17Z

Totally agree. Current definition is too wide. Actually in different places the term "bytes-like object" can imply different requirements.

Supports buffer protocol. list isn't.
Contiguous. memoryview()[::2] isn't.
len() returns bytes size. array('I') isn't.
Supported indexing (and slicing) of bytes. array('I') isn't.
Indexing returns integers in the range 0-255. array('b') isn't.
Supports concatenation. memoryview isn't.
Supports common bytes and bytearray methods, such as find() or lower().
A subclass of (bytes, bytearray).
A subclass of bytes.
A bytes itself.

bitdancer · 2015-03-24T15:00:55Z

This not dissimilar to the problem we have with "file like object" or "file object". The requirements on them are not consistent. I'm not sure what the solution is in either case, but for file like objects we have decided to ignore the issue, I think. (ie: what exactly a file like object needs to support in order to work with a give API depends on that API, and we just live with that).

vadmium · 2015-04-01T11:33:05Z

After doing a bit of reading and experimenting, I think we should at least restrict bytes-like objects to “C-contiguous”. Any looser definition risks memoryview(byteslike).tobytes() returning the bytes in a different order than their true in-memory order. Fortran-style contiguous arrays aren’t enough:

>>> import _testbuffer, sys
>>> fortran = memoryview(_testbuffer.ndarray([11, 12, 21, 22], format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0))
>>> fortran.f_contiguous
True
>>> fortran.c_contiguous
False
>>> fortran.tolist()
[[11, 21], [12, 22]]
>>> tuple(bytes(fortran))
(11, 21, 12, 22)
>>> sys.stdout.buffer.write(fortran)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BufferError: memoryview: underlying buffer is not C-contiguous

So I am proposing a patch which:

Restricts the bytes-like object definition to C-contiguous buffers
Explains what I think is actually meant by “contiguous” in the C API buffer protocol page. Turns out it is generally a more strict definition than I originally assumed.
Explains why memoryview.tobytes() is out of order for non C-contiguous buffers
Has a couple other fixes taking into acount memoryview.tolist() doesn’t work for zero dimensions, and is nested for more than one dimension

serhiy-storchaka · 2015-04-01T11:44:28Z

+1 for the idea overall and the patch LGTM.

skrah · 2015-04-01T12:35:07Z

I have a somewhat general concern: In the last year or so, issues seem
to expand far beyond the scope that's indicated by the issue title.

For example, I don't particularly care about the definition of
"bytes-like", but the patch contains changes to areas I *do* care
about.

I don't think all of the changes are an improvement: What is
the "flattened length", why does C-contiguous have to be explained?

bitdancer · 2015-04-01T14:01:45Z

I found the explanation of C-contiguous vs Fortran-contiguous helpful (and I've programmed in both of those languages, though granted not much :). However, based on that it is not obvious to me why having a fortran-contiguous buffer prevents it from being used in the bytes-like object contexts (though granted the order might be surprising to someone who is not thinking about the memory ordering and just assuming C).

I don't have much of an opinion on the other non-glossary-entry changes, but at a quick read I'm not sure how much clarity they add, if any.

bitdancer · 2015-04-01T14:05:24Z

Oh, and about the general concern: I agree that this issue was apparently about the glossary entry, so making other changes is suspect and at a minimum requires adding relevant people from the experts list to nosy to get proper review of other proposed changes.

serhiy-storchaka · 2015-04-01T14:24:00Z

What people are needed? The patch looks as great improvement to me.

bitdancer · 2015-04-01T14:43:24Z

Stefan, since he's the current maintainer of the memoryview implementation. Fortunately he spotted the issue :) Maybe Antoine, too; he's done work in this area. I'll add him.

serhiy-storchaka · 2015-04-01T15:06:10Z

See also bpo-23376.

vadmium · 2015-04-01T21:37:21Z

I will to pull the stdtypes.rst changes out into a separate patch and issue, if that will make review easier. I think they are an improvement because the previous version was incorrect and misleading, but they are probably not necessary for people to understand what a C-contiguous bytes-like object is.

I don’t think “flattened length” is explicitly defined anywhere, but it is already used in the memoryview() documentation and elsewhere. I took it to mean the number of elements in the nested list, if you ignore the fact that it is nested; i.e. ignore the extra "[" and "]" delimiters in the repr(), and just count the other values inside them.

The reason for defining C-contiguous was that I originally understood “contiguous” to be much more general than what seems to be meant. I assumed both memoryview(contiguous)[::-1] and a 2×2×2 array with stride=[4, 1, 2] would be contiguous, although neither have the C or Fortran array layout.

I think we need to define C-contiguous well enough for people to understand which standard Python objects are bytes-like objects. Maybe not Fortran-contiguous, because it doesn’t seem relevant to standard Python objects. Considering Serhiy asked if array.array() is always C-contiguous, I may not have done enough to explain that. (I think the answer is always yes.)

David: If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order.

skrah · 2015-04-02T11:10:08Z

If you think that the previous version was "incorrect and misleading",
the bar for changes to be accepted seems pretty high for me.

"grep -r" doesn't seem to find "flattened length" in Doc/*.

"An object that supports the :ref:`bufferobject` and is C-contiguous, like :class:`bytes`, :class:`bytearray` or :class:`memoryview`."

This implies that all memoryviews are C-contiguous.

bitdancer · 2015-04-02T14:57:07Z

If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order.

I'm still not seeing how this would cause such an object to throw an error if used in a bytes-like context. I presume by the above that you mean that the results of passing the object directly to a bytes like context differs from the results of calling .tobytes() on it and passing *that* to the bytes like context. That's not what your suggested documentation change says, though.

vadmium · 2015-04-03T11:37:54Z

I’m sorry Stefan, I now realize my changes for len(view) were indeed wrong, and the original was much more correct. I still think the tobytes() and maybe tolist() documentation could be improved, but that is a separate issue to the bytes-like definition.

I am posting c-contig.v2.patch. Hopefully you will find it is truer to my original scope :)

Removed changes to stdtypes.rst
Scaled back changes in buffer.rst to only explain “C-contiguous”
Tweaked glossary definition. Not all memoryview() objects are applicable.

David: The result of passing a Fortran array directly in a bytes-like context is typically BufferError. If this were relaxed, then we would get the inconsistency with tobytes().

>>> import _testbuffer, sys
>>> layout = [11, 21, 12, 22]
>>> fortran_array = _testbuffer.ndarray(layout, format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0)
>>> sys.stdout.buffer.write(fortran_array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BufferError: ndarray is not C-contiguous
>>> list(memoryview(fortran_array).tobytes())  # C-contiguous order!
[11, 12, 21, 22]

skrah · 2015-07-27T12:36:46Z

Sorry, I'm still not convinced that the C-contiguity explanation is in the right place. The docs have to be terse in order to be useful as a reference, and the explanation at that particular location breaks the flow of reading. So, please don't commit that.

The glossary update looks good to me. -- C-contiguity could also be
explained in the glossary, but I prefer this explanation from

http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html

"Specify the order of the array. If order is ‘C’ (default), then the array will be in C-contiguous order (last-index varies the fastest). If order is ‘F’, then the returned array will be in Fortran-contiguous order (first-index varies the fastest)."

vadmium · 2015-07-29T04:34:26Z

Patch v3:

Merged with recent changes
New glossary entry for “contiguous”, with incoming links from various points
Removed my definition from buffer.rst

I found the Num Py explanation a bit brief, assuming what you quoted was the extent of it. I used some of that wording, and added a bit more, although it is still not a complete definition. Let me know if you think this version is acceptable.

skrah · 2015-07-29T18:49:41Z

c-contig.v3.patch LGTM.

skrah · 2015-08-08T12:09:20Z

I would commit this, except that I'm not too happy with the use of
the term "bytes-like" in general. Yesterday I mistyped this:

>>> import ctypes
>>> x = ctypes.c_double
>>> m = memoryview(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: memoryview: a bytes-like object is required, not '_ctypes.PyCSimpleType'

The previous error message was (changed in bpo-16518) was:

"_ctypes.PyCSimpleType does not support the buffer interface".

Which I find much clearer. Memoryviews (for better or worse,
but PEP-3118 was accepted) are Mini-NumPy-arrays. I'm still not
sure if we should hide that from users.

pitrou · 2015-08-08T12:10:47Z

The thing is, most users don't know what the buffer protocol is (even Numpy users, probably), while "bytes-like" at least will make some sense - even though I agree it's an imperfect approximation.

skrah · 2015-08-08T12:29:04Z

For end users it's probably adequate. But several committers find
the term confusing (msg236283, msg188484). :)

Anyway, I'm going to commit this since it adds clarity.

python-dev · 2015-08-08T12:34:51Z

New changeset d1ef54751412 by Stefan Krah in branch '3.5':
Issue bpo-23756: Clarify the terms "contiguous" and "bytes-like object".
https://hg.python.org/cpython/rev/d1ef54751412

New changeset c59b2c4f4cac by Stefan Krah in branch 'default':
Merge bpo-23756.
https://hg.python.org/cpython/rev/c59b2c4f4cac

skrah · 2015-08-08T12:37:35Z

Martin, thanks for the patch!

vadmium assigned docspython Mar 24, 2015

vadmium added the docs Documentation in the Doc dir label Mar 24, 2015

ezio-melotti added the type-feature A feature request or enhancement label Mar 24, 2015

skrah mannequin assigned skrah and unassigned docspython Apr 2, 2015

skrah mannequin removed their assignment Jul 27, 2015

skrah mannequin closed this as completed Aug 8, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten definition of bytes-like objects #67944

Tighten definition of bytes-like objects #67944

vadmium commented Mar 24, 2015

vadmium commented Mar 24, 2015

serhiy-storchaka commented Mar 24, 2015

bitdancer commented Mar 24, 2015

vadmium commented Apr 1, 2015

serhiy-storchaka commented Apr 1, 2015

skrah mannequin commented Apr 1, 2015

bitdancer commented Apr 1, 2015

bitdancer commented Apr 1, 2015

serhiy-storchaka commented Apr 1, 2015

bitdancer commented Apr 1, 2015

serhiy-storchaka commented Apr 1, 2015

vadmium commented Apr 1, 2015

skrah mannequin commented Apr 2, 2015

bitdancer commented Apr 2, 2015

vadmium commented Apr 3, 2015

skrah mannequin commented Jul 27, 2015

vadmium commented Jul 29, 2015

skrah mannequin commented Jul 29, 2015

skrah mannequin commented Aug 8, 2015

pitrou commented Aug 8, 2015

skrah mannequin commented Aug 8, 2015

python-dev mannequin commented Aug 8, 2015

skrah mannequin commented Aug 8, 2015

Tighten definition of bytes-like objects #67944

Tighten definition of bytes-like objects #67944

Comments

vadmium commented Mar 24, 2015

vadmium commented Mar 24, 2015

serhiy-storchaka commented Mar 24, 2015

bitdancer commented Mar 24, 2015

vadmium commented Apr 1, 2015

serhiy-storchaka commented Apr 1, 2015

skrah mannequin commented Apr 1, 2015

bitdancer commented Apr 1, 2015

bitdancer commented Apr 1, 2015

serhiy-storchaka commented Apr 1, 2015

bitdancer commented Apr 1, 2015

serhiy-storchaka commented Apr 1, 2015

vadmium commented Apr 1, 2015

skrah mannequin commented Apr 2, 2015

bitdancer commented Apr 2, 2015

vadmium commented Apr 3, 2015

skrah mannequin commented Jul 27, 2015

vadmium commented Jul 29, 2015

skrah mannequin commented Jul 29, 2015

skrah mannequin commented Aug 8, 2015

pitrou commented Aug 8, 2015

skrah mannequin commented Aug 8, 2015

python-dev mannequin commented Aug 8, 2015

skrah mannequin commented Aug 8, 2015