-
-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tighten definition of bytes-like objects #67944
Comments
There are moves at documenting and implementing support for “bytes-like” objects in more APIs, such as the “io” module (bpo-20699), http.client (bpo-23740). The glossary definition is currently “An object that supports the Buffer Protocol, like bytes, bytearray or memoryview.” This was originally added for bpo-16518. However after reading bpo-23688, I realized that it should probably not mean absolutely _any_ object supporting the buffer protocol. For instance: >>> reverse_view = memoryview(b"123")[::-1]
>>> stdout.buffer.write(reverse_view)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BufferError: memoryview: underlying buffer is not C-contiguous I think the definition should at least be tightened to only objects with a contiguous buffer, and “contiguous” should be defined (probably in the linked C API page or the memoryview.contiguous flag definition, not the glossary). So far, my understanding is these are contiguous:
and these are not contiguous:
Perhaps the bytes-like definition should tightened further, to match the above error message, to only “C-contiguous” buffers. I understand that C-contiguous means the strides tuple has to be in non-strict decreasing order, e.g. for 2 × 1 × 3 arrays, strides == (3, 3, 1) is C-contiguous, but strides == (1, 3, 3) is not. This also needs documenting. I’m not so sure about these, but the definition could be tightened even further:
|
Totally agree. Current definition is too wide. Actually in different places the term "bytes-like object" can imply different requirements.
|
This not dissimilar to the problem we have with "file like object" or "file object". The requirements on them are not consistent. I'm not sure what the solution is in either case, but for file like objects we have decided to ignore the issue, I think. (ie: what exactly a file like object needs to support in order to work with a give API depends on that API, and we just live with that). |
After doing a bit of reading and experimenting, I think we should at least restrict bytes-like objects to “C-contiguous”. Any looser definition risks memoryview(byteslike).tobytes() returning the bytes in a different order than their true in-memory order. Fortran-style contiguous arrays aren’t enough: >>> import _testbuffer, sys
>>> fortran = memoryview(_testbuffer.ndarray([11, 12, 21, 22], format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0))
>>> fortran.f_contiguous
True
>>> fortran.c_contiguous
False
>>> fortran.tolist()
[[11, 21], [12, 22]]
>>> tuple(bytes(fortran))
(11, 21, 12, 22)
>>> sys.stdout.buffer.write(fortran)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BufferError: memoryview: underlying buffer is not C-contiguous So I am proposing a patch which:
|
+1 for the idea overall and the patch LGTM. |
I have a somewhat general concern: In the last year or so, issues seem For example, I don't particularly care about the definition of I don't think all of the changes are an improvement: What is |
I found the explanation of C-contiguous vs Fortran-contiguous helpful (and I've programmed in both of those languages, though granted not much :). However, based on that it is not obvious to me why having a fortran-contiguous buffer prevents it from being used in the bytes-like object contexts (though granted the order might be surprising to someone who is not thinking about the memory ordering and just assuming C). I don't have much of an opinion on the other non-glossary-entry changes, but at a quick read I'm not sure how much clarity they add, if any. |
Oh, and about the general concern: I agree that this issue was apparently about the glossary entry, so making other changes is suspect and at a minimum requires adding relevant people from the experts list to nosy to get proper review of other proposed changes. |
What people are needed? The patch looks as great improvement to me. |
Stefan, since he's the current maintainer of the memoryview implementation. Fortunately he spotted the issue :) Maybe Antoine, too; he's done work in this area. I'll add him. |
See also bpo-23376. |
I will to pull the stdtypes.rst changes out into a separate patch and issue, if that will make review easier. I think they are an improvement because the previous version was incorrect and misleading, but they are probably not necessary for people to understand what a C-contiguous bytes-like object is. I don’t think “flattened length” is explicitly defined anywhere, but it is already used in the memoryview() documentation and elsewhere. I took it to mean the number of elements in the nested list, if you ignore the fact that it is nested; i.e. ignore the extra "[" and "]" delimiters in the repr(), and just count the other values inside them. The reason for defining C-contiguous was that I originally understood “contiguous” to be much more general than what seems to be meant. I assumed both memoryview(contiguous)[::-1] and a 2×2×2 array with stride=[4, 1, 2] would be contiguous, although neither have the C or Fortran array layout. I think we need to define C-contiguous well enough for people to understand which standard Python objects are bytes-like objects. Maybe not Fortran-contiguous, because it doesn’t seem relevant to standard Python objects. Considering Serhiy asked if array.array() is always C-contiguous, I may not have done enough to explain that. (I think the answer is always yes.) David: If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order. |
If you think that the previous version was "incorrect and misleading", "grep -r" doesn't seem to find "flattened length" in Doc/*. "An object that supports the :ref:`bufferobject` and is C-contiguous, like :class:`bytes`, :class:`bytearray` or :class:`memoryview`." This implies that all memoryviews are C-contiguous. |
I'm still not seeing how this would cause such an object to throw an error if used in a bytes-like context. I presume by the above that you mean that the results of passing the object directly to a bytes like context differs from the results of calling .tobytes() on it and passing *that* to the bytes like context. That's not what your suggested documentation change says, though. |
I’m sorry Stefan, I now realize my changes for len(view) were indeed wrong, and the original was much more correct. I still think the tobytes() and maybe tolist() documentation could be improved, but that is a separate issue to the bytes-like definition. I am posting c-contig.v2.patch. Hopefully you will find it is truer to my original scope :)
David: The result of passing a Fortran array directly in a bytes-like context is typically BufferError. If this were relaxed, then we would get the inconsistency with tobytes(). >>> import _testbuffer, sys
>>> layout = [11, 21, 12, 22]
>>> fortran_array = _testbuffer.ndarray(layout, format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0)
>>> sys.stdout.buffer.write(fortran_array)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BufferError: ndarray is not C-contiguous
>>> list(memoryview(fortran_array).tobytes()) # C-contiguous order!
[11, 12, 21, 22] |
Sorry, I'm still not convinced that the C-contiguity explanation is in the right place. The docs have to be terse in order to be useful as a reference, and the explanation at that particular location breaks the flow of reading. So, please don't commit that. The glossary update looks good to me. -- C-contiguity could also be http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html "Specify the order of the array. If order is ‘C’ (default), then the array will be in C-contiguous order (last-index varies the fastest). If order is ‘F’, then the returned array will be in Fortran-contiguous order (first-index varies the fastest)." |
Patch v3:
I found the Num Py explanation a bit brief, assuming what you quoted was the extent of it. I used some of that wording, and added a bit more, although it is still not a complete definition. Let me know if you think this version is acceptable. |
c-contig.v3.patch LGTM. |
I would commit this, except that I'm not too happy with the use of >>> import ctypes
>>> x = ctypes.c_double
>>> m = memoryview(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: memoryview: a bytes-like object is required, not '_ctypes.PyCSimpleType' The previous error message was (changed in bpo-16518) was: "_ctypes.PyCSimpleType does not support the buffer interface". Which I find much clearer. Memoryviews (for better or worse, |
The thing is, most users don't know what the buffer protocol is (even Numpy users, probably), while "bytes-like" at least will make some sense - even though I agree it's an imperfect approximation. |
New changeset d1ef54751412 by Stefan Krah in branch '3.5': New changeset c59b2c4f4cac by Stefan Krah in branch 'default': |
Martin, thanks for the patch! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: