msg142650 - (view) |
Author: John O'Connor (jcon) |
Date: 2011-08-22 00:43 |
I noticed there are a few more common cases that can be sped up in bytes.join().
When:
- The sequence length and separator length are both 0
- The separator length is 0 or 1
- The separator string contains only 1 distinct byte
These could also be applied to other sequence types.
for i in {{0..5}}; do
./python -m timeit -s "f=open('/usr/share/dict/words', 'rb'); L=f.readlines()" \
"b' '.join(L)";
done
- BEFORE -
100 loops, best of 3: 3.79 msec per loop
100 loops, best of 3: 3.5 msec per loop
100 loops, best of 3: 3.75 msec per loop
- AFTER -
100 loops, best of 3: 2.81 msec per loop
100 loops, best of 3: 2.81 msec per loop
100 loops, best of 3: 3.03 msec per loop
~20% speedup
Same test on a smaller list (lines from LICENSE file) yields a similar 15-20% speedup.
Same test on L = [b'1', b'2', b'3'] yields 10% speedup
|
msg157790 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-04-08 15:14 |
Honestly, I don't think there's much point in these optimizations. The first one ("The sequence length and separator length are both 0") is probably useless.
|
msg157802 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-04-08 17:59 |
> Honestly, I don't think there's much point in these optimizations. The first one ("The sequence length and separator length are both 0") is probably useless.
This optimization changes the behavior.
... def __repr__(self): return 'B(' + super().__repr__() + ')'
...
>>> B()
B(b'')
>>> B().join([])
b''
With the patch we get B(b'').
Regardless of whether optimization (which is negligible in this case), I don't know whether to consider such side effect desirable or undesirable.
bytes.join need to optimize for the 0- and 1-byte delimiter, but the proposed patch leads to pessimization:
Unpatched:
$ ./python -m timeit -s 'seq=[bytes([i]*100000) for i in range(256)]' 'b"".join(seq)'
10 loops, best of 3: 43.3 msec per loop
Patched:
$ ./python -m timeit -s 'seq=[bytes([i]*100000) for i in range(256)]' 'b" ".join(seq)'
10 loops, best of 3: 70.3 msec per loop
|
msg157803 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-04-08 18:03 |
> Regardless of whether optimization (which is negligible in this case), I don't know whether to consider such side effect desirable or undesirable.
After some thought, I realized that this is an erroneous behavior.
|
msg157804 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2012-04-08 18:37 |
Side note: Windows requires that args be quoted with ", not ', to work properly, at least with these args.
Main note: the patched test adds a space to the separator, but that is not enough to account for the difference.
c:\Programs\Python32>python -m timeit -s "seq=[bytes([i]*1000) for i in range(256)]" "b''.join(seq)"
10000 loops, best of 3: 31.7 usec per loop
c:\Programs\Python32>python -m timeit -s "seq=[bytes([i]*1000) for i in range(256)]" "b' '.join(seq)"
10000 loops, best of 3: 34.1 usec per loop
The behavior change is wrong and test_bytes.py seems to need augmentation. It begins with "XXX This is a mess. [...]".
class BaseBytesTest(unittest.TestCase):
def test_join(self)
tests b''.join([]) but as far as I can tell, not b'something'.join([]), the failing case found by Serhiy.
It end with " # XXX more...".
|
msg157806 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-04-08 19:08 |
> Main note: the patched test adds a space to the separator, but that is not enough to account for the difference.
Sorry, I copied the wrong line.
$ ./python -m timeit -s "seq=[bytes([i]*100000) for i in range(256)]" "b' '.join(seq)"
10 loops, best of 3: 45.6 msec per loop
But the conclusions are not affected, for larger byte strings patched
version will be 1.5-2 times slower in limit.
|
msg173324 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-19 11:25 |
I adapted a patch to the current code (now it works for bytearrays too),
deleted the behavior change, left optimization only for 0- and 1-bytes
separator, deleted tests (bytes.join tests already exists), wrote a benchmark
and after long experiments has set a condition for this optimization. Now, I
believe, the patch never leads to regression. The patch has been simplified,
it's only 13 lines long.
Thank you, John O'Connor.
|
msg173364 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-19 23:43 |
I don't understand the point of this optimization. It looks rather cryptic and very specific. Can you elaborate?
|
msg173375 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-20 08:53 |
In a normal loop we copy into the output buffer for each iteration the element and the separator. As you know, the processor can copy large amounts of data several times more efficiently than byte-for-byte. Therefore, filling the buffer by memset more efficient than calling Py_MEMCPY on each iteration if the buffer size is not too large. Experimenting on different machines I found that Py_MEMCPY is more expensive than sequential filling up to 32 bytes (may be even more, but I chose a conservative estimate). And for empty separator we can just omit copying and filling (it saves several checks per iteration).
|
msg173378 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-20 09:37 |
Ok, but does it really make a difference and in which cases?
In other words, do you have benchmark numbers? :)
|
msg173398 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-20 16:24 |
> Ok, but does it really make a difference and in which cases?
Up to 40% on Athlon and up to 70% on Atom.
> In other words, do you have benchmark numbers? :)
Attached results for AMD Athlon 64 X2 4600+ and Intel Atom N570 @ 1.66GHz
under 32-bit Linux. You can run the benchmark on your machine (requires Posix-
compatible shell, sed awk).
|
msg173402 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-20 18:15 |
Ok, here are the benchmark results here with a 1-byte separator:
10 x 10 0.244 usec 0.202 usec +21%
100 x 10 0.325 usec 0.326 usec -0%
1000 x 10 0.691 usec 0.689 usec +0%
10 x 1000 18.2 usec 14.2 usec +28%
100 x 1000 39.8 usec 40.6 usec -2%
1000 x 1000 94.6 usec 96 usec -1%
and with an empty separator:
10 x 10 0.245 usec 0.198 usec +24%
100 x 10 0.335 usec 0.286 usec +17%
1000 x 10 0.637 usec 0.593 usec +7%
10 x 1000 18.9 usec 14.1 usec +34%
100 x 1000 40.3 usec 36.2 usec +11%
1000 x 1000 93.7 usec 96.9 usec -3%
(Core i5 2500K, 64-bit, gcc)
I would say the empty separator case is interesting, because a common use case for bytes.join() is indeed fast concatenation. However, the 1-byte separator case could be dropped, which would simplify the patch and the heuristic.
|
msg173403 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-10-20 18:53 |
Indeed, 1-char separator is more common for strings, but I found several b'\n'.join or b','.join even in stdlib. 3 lines of difference are only 2.3% of Objects/stringlib/join.h.
Here is a patch with dropped the 1-byte separator case.
|
msg173416 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-10-20 21:14 |
New changeset bfa715f98c0f by Antoine Pitrou in branch 'default':
Issue #12805: Make bytes.join and bytearray.join faster when the separator is empty.
http://hg.python.org/cpython/rev/bfa715f98c0f
|
msg173417 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-10-20 21:15 |
I have committed the simple optimization patch, thank you!
|
msg173418 - (view) |
Author: John O'Connor (jcon) |
Date: 2012-10-20 21:18 |
I think that's a good compromise and much better than what I had originally.
- John
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:20 | admin | set | github: 57014 |
2012-10-20 21:18:38 | jcon | set | messages:
+ msg173418 |
2012-10-20 21:15:41 | pitrou | set | status: open -> closed resolution: fixed messages:
+ msg173417
stage: resolved |
2012-10-20 21:14:54 | python-dev | set | nosy:
+ python-dev messages:
+ msg173416
|
2012-10-20 18:54:51 | serhiy.storchaka | set | files:
+ bytes_join_optimization_3.patch |
2012-10-20 18:53:01 | serhiy.storchaka | set | messages:
+ msg173403 |
2012-10-20 18:15:54 | pitrou | set | messages:
+ msg173402 |
2012-10-20 16:24:42 | serhiy.storchaka | set | files:
+ bytes_join_bench.res, bytes_join_bench_2.res
messages:
+ msg173398 |
2012-10-20 09:37:21 | pitrou | set | messages:
+ msg173378 |
2012-10-20 08:53:27 | serhiy.storchaka | set | messages:
+ msg173375 versions:
+ Python 3.4, - Python 3.3 |
2012-10-19 23:43:45 | pitrou | set | messages:
+ msg173364 |
2012-10-19 11:25:31 | serhiy.storchaka | set | files:
+ bytes_join_optimization_2.patch, bytes_join_bench.sh
messages:
+ msg173324 |
2012-04-08 19:08:25 | serhiy.storchaka | set | messages:
+ msg157806 |
2012-04-08 18:37:44 | terry.reedy | set | messages:
+ msg157804 |
2012-04-08 18:03:30 | serhiy.storchaka | set | messages:
+ msg157803 |
2012-04-08 17:59:16 | serhiy.storchaka | set | messages:
+ msg157802 |
2012-04-08 15:14:04 | pitrou | set | messages:
+ msg157790 |
2012-04-07 17:28:07 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka
|
2011-08-28 17:58:31 | terry.reedy | set | nosy:
+ terry.reedy
|
2011-08-22 01:21:44 | jcon | set | title: Optimzations for bytes.join() et. al -> Optimizations for bytes.join() et. al |
2011-08-22 00:43:55 | jcon | create | |