Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten definition of bytes-like objects #67944

Closed
vadmium opened this issue Mar 24, 2015 · 23 comments
Closed

Tighten definition of bytes-like objects #67944

vadmium opened this issue Mar 24, 2015 · 23 comments
Labels
docs Documentation in the Doc dir type-feature A feature request or enhancement

Comments

@vadmium
Copy link
Member

vadmium commented Mar 24, 2015

BPO 23756
Nosy @pitrou, @ezio-melotti, @bitdancer, @skrah, @vadmium, @serhiy-storchaka
Files
  • c-contig.patch
  • c-contig.v2.patch
  • c-contig.v3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-08-08.12:37:35.630>
    created_at = <Date 2015-03-24.08:25:17.197>
    labels = ['type-feature', 'docs']
    title = 'Tighten definition of bytes-like objects'
    updated_at = <Date 2015-08-08.12:37:35.627>
    user = 'https://github.com/vadmium'

    bugs.python.org fields:

    activity = <Date 2015-08-08.12:37:35.627>
    actor = 'skrah'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-08-08.12:37:35.630>
    closer = 'skrah'
    components = ['Documentation']
    creation = <Date 2015-03-24.08:25:17.197>
    creator = 'martin.panter'
    dependencies = []
    files = ['38780', '38815', '40051']
    hgrepos = []
    issue_num = 23756
    keywords = ['patch']
    message_count = 23.0
    messages = ['239097', '239101', '239136', '239783', '239784', '239795', '239815', '239816', '239817', '239818', '239819', '239851', '239901', '239919', '239971', '247464', '247558', '247603', '248263', '248264', '248265', '248266', '248267']
    nosy_count = 8.0
    nosy_names = ['pitrou', 'ezio.melotti', 'r.david.murray', 'skrah', 'docs@python', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue23756'
    versions = ['Python 3.5', 'Python 3.6']

    @vadmium
    Copy link
    Member Author

    vadmium commented Mar 24, 2015

    There are moves at documenting and implementing support for “bytes-like” objects in more APIs, such as the “io” module (bpo-20699), http.client (bpo-23740). The glossary definition is currently “An object that supports the Buffer Protocol, like bytes, bytearray or memoryview.” This was originally added for bpo-16518. However after reading bpo-23688, I realized that it should probably not mean absolutely _any_ object supporting the buffer protocol. For instance:

    >>> reverse_view = memoryview(b"123")[::-1]
    >>> stdout.buffer.write(reverse_view)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    BufferError: memoryview: underlying buffer is not C-contiguous

    I think the definition should at least be tightened to only objects with a contiguous buffer, and “contiguous” should be defined (probably in the linked C API page or the memoryview.contiguous flag definition, not the glossary). So far, my understanding is these are contiguous:

    • A zero-dimensional object, such as a ctypes object
    • An multi-dimensional array with items stored contiguously in order of increasing indexes. I.e. a_2,2 is stored somewhere after both a_1,2 and a_2,1, and all the strides are positive.

    and these are not contiguous:

    • memoryview(contiguous)[::2], because there are memory gaps between the items
    • memoryview(contiguous)[::-1], despite there being no gaps nor overlapping items
    • Views that set the “suboffsets” field (i.e. include pointers to further memory)
    • Views where different array items overlap each other (e.g. 0 in view.strides)

    Perhaps the bytes-like definition should tightened further, to match the above error message, to only “C-contiguous” buffers. I understand that C-contiguous means the strides tuple has to be in non-strict decreasing order, e.g. for 2 × 1 × 3 arrays, strides == (3, 3, 1) is C-contiguous, but strides == (1, 3, 3) is not. This also needs documenting.

    I’m not so sure about these, but the definition could be tightened even further:

    • Require memoryview(x).cast("B") to be supported. Otherwise, native Python code would have to use workarounds like struct.pack_into() to write to the “bytes-like” object. See bpo-15944.
    • Require len(view) == view.nbytes. This would help in some cases avoid the bug that I have seen of code naively calling len(data), but the downside is ctypes objects would no longer be considered bytes-like objects.

    @vadmium vadmium added the docs Documentation in the Doc dir label Mar 24, 2015
    @ezio-melotti ezio-melotti added the type-feature A feature request or enhancement label Mar 24, 2015
    @serhiy-storchaka
    Copy link
    Member

    Totally agree. Current definition is too wide. Actually in different places the term "bytes-like object" can imply different requirements.

    • Supports buffer protocol. list isn't.

    • Contiguous. memoryview()[::2] isn't.

    • len() returns bytes size. array('I') isn't.

    • Supported indexing (and slicing) of bytes. array('I') isn't.

    • Indexing returns integers in the range 0-255. array('b') isn't.

    • Supports concatenation. memoryview isn't.

    • Supports common bytes and bytearray methods, such as find() or lower().

    • A subclass of (bytes, bytearray).

    • A subclass of bytes.

    • A bytes itself.

    @bitdancer
    Copy link
    Member

    This not dissimilar to the problem we have with "file like object" or "file object". The requirements on them are not consistent. I'm not sure what the solution is in either case, but for file like objects we have decided to ignore the issue, I think. (ie: what exactly a file like object needs to support in order to work with a give API depends on that API, and we just live with that).

    @vadmium
    Copy link
    Member Author

    vadmium commented Apr 1, 2015

    After doing a bit of reading and experimenting, I think we should at least restrict bytes-like objects to “C-contiguous”. Any looser definition risks memoryview(byteslike).tobytes() returning the bytes in a different order than their true in-memory order. Fortran-style contiguous arrays aren’t enough:

    >>> import _testbuffer, sys
    >>> fortran = memoryview(_testbuffer.ndarray([11, 12, 21, 22], format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0))
    >>> fortran.f_contiguous
    True
    >>> fortran.c_contiguous
    False
    >>> fortran.tolist()
    [[11, 21], [12, 22]]
    >>> tuple(bytes(fortran))
    (11, 21, 12, 22)
    >>> sys.stdout.buffer.write(fortran)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    BufferError: memoryview: underlying buffer is not C-contiguous

    So I am proposing a patch which:

    • Restricts the bytes-like object definition to C-contiguous buffers
    • Explains what I think is actually meant by “contiguous” in the C API buffer protocol page. Turns out it is generally a more strict definition than I originally assumed.
    • Explains why memoryview.tobytes() is out of order for non C-contiguous buffers
    • Has a couple other fixes taking into acount memoryview.tolist() doesn’t work for zero dimensions, and is nested for more than one dimension

    @serhiy-storchaka
    Copy link
    Member

    +1 for the idea overall and the patch LGTM.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 1, 2015

    I have a somewhat general concern: In the last year or so, issues seem
    to expand far beyond the scope that's indicated by the issue title.

    For example, I don't particularly care about the definition of
    "bytes-like", but the patch contains changes to areas I *do* care
    about.

    I don't think all of the changes are an improvement: What is
    the "flattened length", why does C-contiguous have to be explained?

    @bitdancer
    Copy link
    Member

    I found the explanation of C-contiguous vs Fortran-contiguous helpful (and I've programmed in both of those languages, though granted not much :). However, based on that it is not obvious to me why having a fortran-contiguous buffer prevents it from being used in the bytes-like object contexts (though granted the order might be surprising to someone who is not thinking about the memory ordering and just assuming C).

    I don't have much of an opinion on the other non-glossary-entry changes, but at a quick read I'm not sure how much clarity they add, if any.

    @bitdancer
    Copy link
    Member

    Oh, and about the general concern: I agree that this issue was apparently about the glossary entry, so making other changes is suspect and at a minimum requires adding relevant people from the experts list to nosy to get proper review of other proposed changes.

    @serhiy-storchaka
    Copy link
    Member

    What people are needed? The patch looks as great improvement to me.

    @bitdancer
    Copy link
    Member

    Stefan, since he's the current maintainer of the memoryview implementation. Fortunately he spotted the issue :) Maybe Antoine, too; he's done work in this area. I'll add him.

    @serhiy-storchaka
    Copy link
    Member

    See also bpo-23376.

    @vadmium
    Copy link
    Member Author

    vadmium commented Apr 1, 2015

    I will to pull the stdtypes.rst changes out into a separate patch and issue, if that will make review easier. I think they are an improvement because the previous version was incorrect and misleading, but they are probably not necessary for people to understand what a C-contiguous bytes-like object is.

    I don’t think “flattened length” is explicitly defined anywhere, but it is already used in the memoryview() documentation and elsewhere. I took it to mean the number of elements in the nested list, if you ignore the fact that it is nested; i.e. ignore the extra "[" and "]" delimiters in the repr(), and just count the other values inside them.

    The reason for defining C-contiguous was that I originally understood “contiguous” to be much more general than what seems to be meant. I assumed both memoryview(contiguous)[::-1] and a 2×2×2 array with stride=[4, 1, 2] would be contiguous, although neither have the C or Fortran array layout.

    I think we need to define C-contiguous well enough for people to understand which standard Python objects are bytes-like objects. Maybe not Fortran-contiguous, because it doesn’t seem relevant to standard Python objects. Considering Serhiy asked if array.array() is always C-contiguous, I may not have done enough to explain that. (I think the answer is always yes.)

    David: If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order.

    @skrah skrah mannequin assigned skrah and unassigned docspython Apr 2, 2015
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 2, 2015

    If you think that the previous version was "incorrect and misleading",
    the bar for changes to be accepted seems pretty high for me.

    "grep -r" doesn't seem to find "flattened length" in Doc/*.

    "An object that supports the :ref:`bufferobject` and is C-contiguous, like :class:`bytes`, :class:`bytearray` or :class:`memoryview`."

    This implies that all memoryviews are C-contiguous.

    @bitdancer
    Copy link
    Member

    If a Fortran array was allowed in a bytes-like context without memory copying, the order of the array elements would differ from the order returned by the meoryview.tobytes() method, which essentially is defined to copy them out in C-array or flattend-tolist() order.

    I'm still not seeing how this would cause such an object to throw an error if used in a bytes-like context. I presume by the above that you mean that the results of passing the object directly to a bytes like context differs from the results of calling .tobytes() on it and passing *that* to the bytes like context. That's not what your suggested documentation change says, though.

    @vadmium
    Copy link
    Member Author

    vadmium commented Apr 3, 2015

    I’m sorry Stefan, I now realize my changes for len(view) were indeed wrong, and the original was much more correct. I still think the tobytes() and maybe tolist() documentation could be improved, but that is a separate issue to the bytes-like definition.

    I am posting c-contig.v2.patch. Hopefully you will find it is truer to my original scope :)

    • Removed changes to stdtypes.rst
    • Scaled back changes in buffer.rst to only explain “C-contiguous”
    • Tweaked glossary definition. Not all memoryview() objects are applicable.

    David: The result of passing a Fortran array directly in a bytes-like context is typically BufferError. If this were relaxed, then we would get the inconsistency with tobytes().

    >>> import _testbuffer, sys
    >>> layout = [11, 21, 12, 22]
    >>> fortran_array = _testbuffer.ndarray(layout, format="B", flags=0, shape=[2, 2], strides=[1, 2], offset=0)
    >>> sys.stdout.buffer.write(fortran_array)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    BufferError: ndarray is not C-contiguous
    >>> list(memoryview(fortran_array).tobytes())  # C-contiguous order!
    [11, 12, 21, 22]

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jul 27, 2015

    Sorry, I'm still not convinced that the C-contiguity explanation is in the right place. The docs have to be terse in order to be useful as a reference, and the explanation at that particular location breaks the flow of reading. So, please don't commit that.

    The glossary update looks good to me. -- C-contiguity could also be
    explained in the glossary, but I prefer this explanation from

    http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html

    "Specify the order of the array. If order is ‘C’ (default), then the array will be in C-contiguous order (last-index varies the fastest). If order is ‘F’, then the returned array will be in Fortran-contiguous order (first-index varies the fastest)."

    @skrah skrah mannequin removed their assignment Jul 27, 2015
    @vadmium
    Copy link
    Member Author

    vadmium commented Jul 29, 2015

    Patch v3:

    • Merged with recent changes
    • New glossary entry for “contiguous”, with incoming links from various points
    • Removed my definition from buffer.rst

    I found the Num Py explanation a bit brief, assuming what you quoted was the extent of it. I used some of that wording, and added a bit more, although it is still not a complete definition. Let me know if you think this version is acceptable.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jul 29, 2015

    c-contig.v3.patch LGTM.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 8, 2015

    I would commit this, except that I'm not too happy with the use of
    the term "bytes-like" in general. Yesterday I mistyped this:

    >>> import ctypes
    >>> x = ctypes.c_double
    >>> m = memoryview(x)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: memoryview: a bytes-like object is required, not '_ctypes.PyCSimpleType'

    The previous error message was (changed in bpo-16518) was:

    "_ctypes.PyCSimpleType does not support the buffer interface".

    Which I find much clearer. Memoryviews (for better or worse,
    but PEP-3118 was accepted) are Mini-NumPy-arrays. I'm still not
    sure if we should hide that from users.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 8, 2015

    The thing is, most users don't know what the buffer protocol is (even Numpy users, probably), while "bytes-like" at least will make some sense - even though I agree it's an imperfect approximation.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 8, 2015

    For end users it's probably adequate. But several committers find
    the term confusing (msg236283, msg188484). :)

    Anyway, I'm going to commit this since it adds clarity.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 8, 2015

    New changeset d1ef54751412 by Stefan Krah in branch '3.5':
    Issue bpo-23756: Clarify the terms "contiguous" and "bytes-like object".
    https://hg.python.org/cpython/rev/d1ef54751412

    New changeset c59b2c4f4cac by Stefan Krah in branch 'default':
    Merge bpo-23756.
    https://hg.python.org/cpython/rev/c59b2c4f4cac

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 8, 2015

    Martin, thanks for the patch!

    @skrah skrah mannequin closed this as completed Aug 8, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants