msg170618 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-09-17 18:03 |
This should ideally succeed:
>>> b''.join([memoryview(b'foo'), b'bar'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected bytes, memoryview found
(ditto for bytearray.join)
|
msg170965 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2012-09-22 08:00 |
Attached patch adds support for memoryviews to bytes.join:
>>> b''.join([memoryview(b'foo'), b'bar'])
b'foobar'
The implementation currently has some duplication, because it does a first pass to calculate the total size to allocate, and another pass to create the result that calculates the individual sizes again. Py_SIZE(item) can't be used here for memoryviews, so during the first pass it's now necessary to check for memoryviews and extract the len from the memoryview buffer, and then do it again during the second pass.
If necessary this could be optimized/improved.
I also tried to check for multi-dimensional and non-contiguous buffers, but I didn't manage to obtain a failure so I removed the check for now.
More tests should probably be added to cover these cases, and possibly the patch should be adjusted accordingly.
If/when the patch is OK I'll do the same for bytearrays.
|
msg170974 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-09-22 08:51 |
I think memoryview here is example only, and Antoine had in mind arbitrary buffer objects.
|
msg170976 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2012-09-22 09:13 |
Indeed. Attached new patch.
Tests still need to be improved; bytearrays are still not changed.
|
msg170978 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2012-09-22 09:35 |
We would need to release the buffers and also check for format 'B'.
With issue15958-2.diff this is possible:
>>> import array
>>> a = array.array('d', [1.2345])
>>> b''.join([b'ABC', a])
b'ABC\x8d\x97n\x12\x83\xc0\xf3?'
It is unfortunate that a PyBUF_SIMPLE request does not guarantee 'B'.
I think that's probably a mistake, but it's certainly existing practice.
|
msg170980 - (view) |
Author: Stefan Krah (skrah) * |
Date: 2012-09-22 09:43 |
Also, perhaps we can keep a fast path for bytes and bytearray, but I
didn't time the difference.
|
msg170982 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-09-22 10:21 |
Well, given the following works:
>>> import array
>>> a = array.array('d', [1.2345])
>>> b'' + a
b'\x8d\x97n\x12\x83\xc0\xf3?'
It should also work for bytes.join().
I guess that means I'm against the strict-typedness of memoryviews. As the name suggests, it provides access to some memory area, not some structured array of data.
|
msg170983 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2012-09-22 10:25 |
Attached new refleakless patch.
|
msg170984 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-09-22 10:31 |
> Attached new refleakless patch.
Your approach is dangerous, because the buffers may change size between
two calls to PyObject_GetBuffer(). I think you should keep the
Py_buffers alive in an array, and only release them at the end (it may
also be slightly faster to do so).
A nit: you are adding a lot of newlines in test_bytes.py.
|
msg171008 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-09-22 17:08 |
> I think you should keep the
> Py_buffers alive in an array, and only release them at the end (it may
> also be slightly faster to do so).
However allocation of this array may considerably slow down the function. We
may need the special-case for bytes and bytearray. Stop, and the bytearray (or
bytearray subclass) can change size between two calls to Py_SIZE()?
|
msg172824 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-13 19:24 |
Here is a patch with tests.
|
msg172828 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-13 20:30 |
Patch LGTM, however...
$ ./python -m timeit -s "a=[b'a']*100000" "b','.join(a)"
Vanilla: 3.69 msec per loop
Patched: 11.6 msec per loop
|
msg172829 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-13 20:38 |
> Patch LGTM, however...
>
> $ ./python -m timeit -s "a=[b'a']*100000" "b','.join(a)"
>
> Vanilla: 3.69 msec per loop
> Patched: 11.6 msec per loop
True. It is a bit of a pathological case, though.
|
msg172834 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-13 22:19 |
Here is a patch with restored performance. Is not it too complicated?
|
msg172836 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-13 22:23 |
The problem with your approach is that the sequence could be mutated while another thread is running (_getbuffer() may release the GIL). Then the pre-computed size gets wrong.
|
msg172838 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-13 22:40 |
> The problem with your approach is that the sequence could be mutated while
> another thread is running (_getbuffer() may release the GIL). Then the
> pre-computed size gets wrong.
Well, then I withdraw my patch.
But what if the sequence will be mutated and PySequence_Size(seq) will become
less seqlen? Then using PySequence_Fast_GET_ITEM() will be incorrect.
|
msg172839 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-13 22:47 |
> But what if the sequence will be mutated and PySequence_Size(seq) will become
> less seqlen? Then using PySequence_Fast_GET_ITEM() will be incorrect.
Perhaps we should detect that case and raise, then.
|
msg172996 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-15 19:08 |
Here is an updated patch using PySequence_Fast_GET_SIZE to avoid problems when the sequence is resized during iteration.
|
msg173064 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-16 17:53 |
Here is a new patch checking that the sequence size didn't change. I also refactored the join() implementation into a shared function in stringlib.
|
msg173067 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-16 18:30 |
I added new comments. :-(
|
msg173069 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-16 19:04 |
> I added new comments. :-(
Thanks. I think I will commit after adding the missing #undef :-)
|
msg173071 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-10-16 19:09 |
New changeset 16285c1b4dda by Antoine Pitrou in branch 'default':
Issue #15958: bytes.join and bytearray.join now accept arbitrary buffer objects.
http://hg.python.org/cpython/rev/16285c1b4dda
|
msg173075 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-16 19:11 |
Done now.
|
msg173079 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-16 19:23 |
Well done.
However check at top of Objects/stringlib/join.h does not protect from using
the file with asciilib or ucs1lib.
|
msg173080 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-16 19:24 |
> However check at top of Objects/stringlib/join.h does not protect from using
> the file with asciilib or ucs1lib.
I'm not sure that's a problem. Someone would have to go out of their way
to use join.h with only UCS1 unicode strings. Also tests would probably
start failing, since str.join doesn't have the right signature :-)
|
msg173254 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-18 11:06 |
Yet one issue. You forgot to add join.h to BYTESTR_DEPS in Makefile.pre.in.
|
msg173290 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-10-18 19:33 |
New changeset 388e43bb519d by Antoine Pitrou in branch 'default':
Followup to issue #15958: add join.h to Makefile dependencies for byte strings
http://hg.python.org/cpython/rev/388e43bb519d
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:36 | admin | set | github: 60162 |
2013-01-22 19:36:53 | glyph | set | nosy:
+ glyph
|
2012-10-18 19:33:20 | python-dev | set | messages:
+ msg173290 |
2012-10-18 11:06:50 | serhiy.storchaka | set | messages:
+ msg173254 |
2012-10-16 19:24:55 | pitrou | set | messages:
+ msg173080 |
2012-10-16 19:23:08 | serhiy.storchaka | set | messages:
+ msg173079 |
2012-10-16 19:11:34 | pitrou | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2012-10-16 19:11:26 | pitrou | set | messages:
+ msg173075 |
2012-10-16 19:09:17 | python-dev | set | nosy:
+ python-dev messages:
+ msg173071
|
2012-10-16 19:04:53 | pitrou | set | messages:
+ msg173069 |
2012-10-16 18:30:02 | serhiy.storchaka | set | messages:
+ msg173067 |
2012-10-16 17:53:29 | pitrou | set | files:
+ bytes_join_buffers3.patch
messages:
+ msg173064 |
2012-10-15 19:08:42 | pitrou | set | files:
+ bytes_join_buffers2.patch
messages:
+ msg172996 |
2012-10-13 23:28:17 | serhiy.storchaka | set | files:
- bytes_join_buffers_2.patch |
2012-10-13 22:47:18 | pitrou | set | messages:
+ msg172839 |
2012-10-13 22:40:44 | serhiy.storchaka | set | messages:
+ msg172838 |
2012-10-13 22:23:25 | pitrou | set | messages:
+ msg172836 |
2012-10-13 22:19:09 | serhiy.storchaka | set | files:
+ bytes_join_buffers_2.patch
messages:
+ msg172834 |
2012-10-13 20:38:17 | pitrou | set | messages:
+ msg172829 |
2012-10-13 20:30:28 | serhiy.storchaka | set | messages:
+ msg172828 |
2012-10-13 19:24:45 | pitrou | set | files:
+ bytes_join_buffers.patch
messages:
+ msg172824 |
2012-09-22 17:08:59 | serhiy.storchaka | set | messages:
+ msg171008 |
2012-09-22 10:31:09 | pitrou | set | messages:
+ msg170984 |
2012-09-22 10:25:20 | ezio.melotti | set | files:
+ issue15958-3.diff
messages:
+ msg170983 |
2012-09-22 10:21:42 | pitrou | set | messages:
+ msg170982 |
2012-09-22 09:43:03 | skrah | set | messages:
+ msg170980 |
2012-09-22 09:35:33 | skrah | set | nosy:
+ skrah messages:
+ msg170978
|
2012-09-22 09:13:56 | ezio.melotti | set | files:
+ issue15958-2.diff
messages:
+ msg170976 |
2012-09-22 08:51:45 | serhiy.storchaka | set | messages:
+ msg170974 |
2012-09-22 08:00:24 | ezio.melotti | set | files:
+ issue15958.diff
nosy:
+ ezio.melotti messages:
+ msg170965
keywords:
+ patch stage: needs patch -> patch review |
2012-09-18 03:23:22 | Arfrever | set | nosy:
+ Arfrever
|
2012-09-17 18:23:48 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka
|
2012-09-17 18:03:30 | pitrou | create | |