Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes.join() should allow arbitrary buffer objects #60162

Closed
pitrou opened this issue Sep 17, 2012 · 27 comments
Closed

bytes.join() should allow arbitrary buffer objects #60162

pitrou opened this issue Sep 17, 2012 · 27 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@pitrou
Copy link
Member

pitrou commented Sep 17, 2012

BPO 15958
Nosy @pitrou, @glyph, @ezio-melotti, @skrah, @serhiy-storchaka
Files
  • issue15958.diff: Patch for bytes + tests
  • issue15958-2.diff
  • issue15958-3.diff
  • bytes_join_buffers.patch
  • bytes_join_buffers2.patch
  • bytes_join_buffers3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-10-16.19:11:34.023>
    created_at = <Date 2012-09-17.18:03:29.979>
    labels = ['interpreter-core', 'type-feature']
    title = 'bytes.join() should allow arbitrary buffer objects'
    updated_at = <Date 2013-01-22.19:36:53.051>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2013-01-22.19:36:53.051>
    actor = 'glyph'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-10-16.19:11:34.023>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2012-09-17.18:03:29.979>
    creator = 'pitrou'
    dependencies = []
    files = ['27257', '27258', '27259', '27554', '27585', '27594']
    hgrepos = []
    issue_num = 15958
    keywords = ['patch']
    message_count = 27.0
    messages = ['170618', '170965', '170974', '170976', '170978', '170980', '170982', '170983', '170984', '171008', '172824', '172828', '172829', '172834', '172836', '172838', '172839', '172996', '173064', '173067', '173069', '173071', '173075', '173079', '173080', '173254', '173290']
    nosy_count = 7.0
    nosy_names = ['pitrou', 'glyph', 'ezio.melotti', 'Arfrever', 'skrah', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue15958'
    versions = ['Python 3.4']

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 17, 2012

    This should ideally succeed:

    >>> b''.join([memoryview(b'foo'), b'bar'])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: sequence item 0: expected bytes, memoryview found

    (ditto for bytearray.join)

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Sep 17, 2012
    @ezio-melotti
    Copy link
    Member

    Attached patch adds support for memoryviews to bytes.join:

    >>> b''.join([memoryview(b'foo'), b'bar'])
    b'foobar'

    The implementation currently has some duplication, because it does a first pass to calculate the total size to allocate, and another pass to create the result that calculates the individual sizes again. Py_SIZE(item) can't be used here for memoryviews, so during the first pass it's now necessary to check for memoryviews and extract the len from the memoryview buffer, and then do it again during the second pass.
    If necessary this could be optimized/improved.

    I also tried to check for multi-dimensional and non-contiguous buffers, but I didn't manage to obtain a failure so I removed the check for now.
    More tests should probably be added to cover these cases, and possibly the patch should be adjusted accordingly.

    If/when the patch is OK I'll do the same for bytearrays.

    @serhiy-storchaka
    Copy link
    Member

    I think memoryview here is example only, and Antoine had in mind arbitrary buffer objects.

    @ezio-melotti
    Copy link
    Member

    Indeed. Attached new patch.
    Tests still need to be improved; bytearrays are still not changed.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Sep 22, 2012

    We would need to release the buffers and also check for format 'B'.
    With bpo-15958-2.diff this is possible:

    >>> import array
    >>> a = array.array('d', [1.2345])
    >>> b''.join([b'ABC', a])
    b'ABC\x8d\x97n\x12\x83\xc0\xf3?'

    It is unfortunate that a PyBUF_SIMPLE request does not guarantee 'B'.
    I think that's probably a mistake, but it's certainly existing practice.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Sep 22, 2012

    Also, perhaps we can keep a fast path for bytes and bytearray, but I
    didn't time the difference.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 22, 2012

    Well, given the following works:

    >>> import array
    >>> a = array.array('d', [1.2345])
    >>> b'' + a
    b'\x8d\x97n\x12\x83\xc0\xf3?'

    It should also work for bytes.join().
    I guess that means I'm against the strict-typedness of memoryviews. As the name suggests, it provides access to some memory area, not some structured array of data.

    @ezio-melotti
    Copy link
    Member

    Attached new refleakless patch.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 22, 2012

    Attached new refleakless patch.

    Your approach is dangerous, because the buffers may change size between
    two calls to PyObject_GetBuffer(). I think you should keep the
    Py_buffers alive in an array, and only release them at the end (it may
    also be slightly faster to do so).

    A nit: you are adding a lot of newlines in test_bytes.py.

    @serhiy-storchaka
    Copy link
    Member

    I think you should keep the
    Py_buffers alive in an array, and only release them at the end (it may
    also be slightly faster to do so).

    However allocation of this array may considerably slow down the function. We
    may need the special-case for bytes and bytearray. Stop, and the bytearray (or
    bytearray subclass) can change size between two calls to Py_SIZE()?

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 13, 2012

    Here is a patch with tests.

    @serhiy-storchaka
    Copy link
    Member

    Patch LGTM, however...

    $ ./python -m timeit -s "a=[b'a']*100000"  "b','.join(a)"

    Vanilla: 3.69 msec per loop
    Patched: 11.6 msec per loop

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 13, 2012

    Patch LGTM, however...

    $ ./python -m timeit -s "a=[b'a']*100000" "b','.join(a)"

    Vanilla: 3.69 msec per loop
    Patched: 11.6 msec per loop

    True. It is a bit of a pathological case, though.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch with restored performance. Is not it too complicated?

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 13, 2012

    The problem with your approach is that the sequence could be mutated while another thread is running (_getbuffer() may release the GIL). Then the pre-computed size gets wrong.

    @serhiy-storchaka
    Copy link
    Member

    The problem with your approach is that the sequence could be mutated while
    another thread is running (_getbuffer() may release the GIL). Then the
    pre-computed size gets wrong.

    Well, then I withdraw my patch.

    But what if the sequence will be mutated and PySequence_Size(seq) will become
    less seqlen? Then using PySequence_Fast_GET_ITEM() will be incorrect.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 13, 2012

    But what if the sequence will be mutated and PySequence_Size(seq) will become
    less seqlen? Then using PySequence_Fast_GET_ITEM() will be incorrect.

    Perhaps we should detect that case and raise, then.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 15, 2012

    Here is an updated patch using PySequence_Fast_GET_SIZE to avoid problems when the sequence is resized during iteration.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 16, 2012

    Here is a new patch checking that the sequence size didn't change. I also refactored the join() implementation into a shared function in stringlib.

    @serhiy-storchaka
    Copy link
    Member

    I added new comments. :-(

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 16, 2012

    I added new comments. :-(

    Thanks. I think I will commit after adding the missing #undef :-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 16, 2012

    New changeset 16285c1b4dda by Antoine Pitrou in branch 'default':
    Issue bpo-15958: bytes.join and bytearray.join now accept arbitrary buffer objects.
    http://hg.python.org/cpython/rev/16285c1b4dda

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 16, 2012

    Done now.

    @pitrou pitrou closed this as completed Oct 16, 2012
    @serhiy-storchaka
    Copy link
    Member

    Well done.

    However check at top of Objects/stringlib/join.h does not protect from using
    the file with asciilib or ucs1lib.

    @pitrou
    Copy link
    Member Author

    pitrou commented Oct 16, 2012

    However check at top of Objects/stringlib/join.h does not protect from using
    the file with asciilib or ucs1lib.

    I'm not sure that's a problem. Someone would have to go out of their way
    to use join.h with only UCS1 unicode strings. Also tests would probably
    start failing, since str.join doesn't have the right signature :-)

    @serhiy-storchaka
    Copy link
    Member

    Yet one issue. You forgot to add join.h to BYTESTR_DEPS in Makefile.pre.in.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 18, 2012

    New changeset 388e43bb519d by Antoine Pitrou in branch 'default':
    Followup to issue bpo-15958: add join.h to Makefile dependencies for byte strings
    http://hg.python.org/cpython/rev/388e43bb519d

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants