classification
Title: Optimizations for bytes.join() et. al
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, haypo, jcon, pitrou, python-dev, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2011-08-22 00:43 by jcon, last changed 2012-10-20 21:18 by jcon. This issue is now closed.

Files
File name Uploaded Description Edit
bytes_join_optimization.patch jcon, 2011-08-22 00:43 review
bytes_join_optimization_2.patch serhiy.storchaka, 2012-10-19 11:25 Optimization limited by the heuristics review
bytes_join_bench.sh serhiy.storchaka, 2012-10-19 11:25 Benchmark script
bytes_join_bench.res serhiy.storchaka, 2012-10-20 16:24 Benchmark results on AMD Athlon 64 X2 4600+
bytes_join_bench_2.res serhiy.storchaka, 2012-10-20 16:24 Benchmark results on Intel Atom N570 @ 1.66GHz
bytes_join_optimization_3.patch serhiy.storchaka, 2012-10-20 18:54 Only empty separator case left review
Messages (16)
msg142650 - (view) Author: John O'Connor (jcon) Date: 2011-08-22 00:43
I noticed there are a few more common cases that can be sped up in bytes.join().

When:
 - The sequence length and separator length are both 0
 - The separator length is 0 or 1
 - The separator string contains only 1 distinct byte

These could also be applied to other sequence types.

for i in {{0..5}}; do
./python -m timeit -s "f=open('/usr/share/dict/words', 'rb'); L=f.readlines()" \
"b' '.join(L)"; 
done

- BEFORE -
100 loops, best of 3: 3.79 msec per loop
100 loops, best of 3: 3.5 msec per loop
100 loops, best of 3: 3.75 msec per loop
- AFTER -
100 loops, best of 3: 2.81 msec per loop
100 loops, best of 3: 2.81 msec per loop
100 loops, best of 3: 3.03 msec per loop

~20% speedup

Same test on a smaller list (lines from LICENSE file) yields a similar 15-20% speedup.
Same test on L = [b'1', b'2', b'3'] yields 10% speedup
msg157790 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-08 15:14
Honestly, I don't think there's much point in these optimizations. The first one ("The sequence length and separator length are both 0") is probably useless.
msg157802 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-08 17:59
> Honestly, I don't think there's much point in these optimizations. The first one ("The sequence length and separator length are both 0") is probably useless.

This optimization changes the behavior.

...     def __repr__(self): return 'B(' + super().__repr__() + ')'
... 
>>> B()
B(b'')
>>> B().join([])
b''

With the patch we get B(b'').

Regardless of whether optimization (which is negligible in this case), I don't know whether to consider such side effect desirable or undesirable.

bytes.join need to optimize for the 0- and 1-byte delimiter, but the proposed patch leads to pessimization:

Unpatched:
$ ./python -m timeit -s 'seq=[bytes([i]*100000) for i in range(256)]' 'b"".join(seq)'
10 loops, best of 3: 43.3 msec per loop

Patched:
$ ./python -m timeit -s 'seq=[bytes([i]*100000) for i in range(256)]' 'b" ".join(seq)'
10 loops, best of 3: 70.3 msec per loop
msg157803 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-08 18:03
> Regardless of whether optimization (which is negligible in this case), I don't know whether to consider such side effect desirable or undesirable.

After some thought, I realized that this is an erroneous behavior.
msg157804 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-04-08 18:37
Side note: Windows requires that args be quoted with ", not ', to work properly, at least with these args.

Main note: the patched test adds a space to the separator, but that is not enough to account for the difference.

c:\Programs\Python32>python -m timeit -s "seq=[bytes([i]*1000) for i in range(256)]" "b''.join(seq)"
10000 loops, best of 3: 31.7 usec per loop

c:\Programs\Python32>python -m timeit -s "seq=[bytes([i]*1000) for i in range(256)]" "b' '.join(seq)"
10000 loops, best of 3: 34.1 usec per loop

The behavior change is wrong and test_bytes.py seems to need augmentation. It begins with "XXX This is a mess. [...]".

class BaseBytesTest(unittest.TestCase):
    def test_join(self)
tests b''.join([]) but as far as I can tell, not b'something'.join([]), the failing case found by Serhiy.
It end with "        # XXX more...".
msg157806 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-08 19:08
> Main note: the patched test adds a space to the separator, but that is not enough to account for the difference.

Sorry, I copied the wrong line.

$ ./python -m timeit -s "seq=[bytes([i]*100000) for i in range(256)]" "b' '.join(seq)"
10 loops, best of 3: 45.6 msec per loop

But the conclusions are not affected, for larger byte strings patched
version will be 1.5-2 times slower in limit.
msg173324 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-19 11:25
I adapted a patch to the current code (now it works for bytearrays too), 
deleted the behavior change, left optimization only for 0- and 1-bytes 
separator, deleted tests (bytes.join tests already exists), wrote a benchmark 
and after long experiments has set a condition for this optimization. Now, I 
believe, the patch never leads to regression. The patch has been simplified, 
it's only 13 lines long.

Thank you, John O'Connor.
msg173364 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-19 23:43
I don't understand the point of this optimization. It looks rather cryptic and very specific. Can you elaborate?
msg173375 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-20 08:53
In a normal loop we copy into the output buffer for each iteration the element and the separator. As you know, the processor can copy large amounts of data several times more efficiently than byte-for-byte. Therefore, filling the buffer by memset more efficient than calling Py_MEMCPY on each iteration if the buffer size is not too large. Experimenting on different machines I found that Py_MEMCPY is more expensive than sequential filling up to 32 bytes (may be even more, but I chose a conservative estimate). And for empty separator we can just omit copying and filling (it saves several checks per iteration).
msg173378 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-20 09:37
Ok, but does it really make a difference and in which cases?
In other words, do you have benchmark numbers? :)
msg173398 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-20 16:24
> Ok, but does it really make a difference and in which cases?

Up to 40% on Athlon and up to 70% on Atom.

> In other words, do you have benchmark numbers? :)

Attached results for AMD Athlon 64 X2 4600+ and Intel Atom N570 @ 1.66GHz 
under 32-bit Linux. You can run the benchmark on your machine (requires Posix-
compatible shell, sed awk).
msg173402 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-20 18:15
Ok, here are the benchmark results here with a 1-byte separator:

  10 x 10       0.244 usec   0.202 usec   +21%
 100 x 10       0.325 usec   0.326 usec    -0%
1000 x 10       0.691 usec   0.689 usec    +0%
  10 x 1000      18.2 usec    14.2 usec   +28%
 100 x 1000      39.8 usec    40.6 usec    -2%
1000 x 1000      94.6 usec      96 usec    -1%

and with an empty separator:

  10 x 10       0.245 usec   0.198 usec   +24%
 100 x 10       0.335 usec   0.286 usec   +17%
1000 x 10       0.637 usec   0.593 usec    +7%
  10 x 1000      18.9 usec    14.1 usec   +34%
 100 x 1000      40.3 usec    36.2 usec   +11%
1000 x 1000      93.7 usec    96.9 usec    -3%

(Core i5 2500K, 64-bit, gcc)

I would say the empty separator case is interesting, because a common use case for bytes.join() is indeed fast concatenation. However, the 1-byte separator case could be dropped, which would simplify the patch and the heuristic.
msg173403 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-20 18:53
Indeed, 1-char separator is more common for strings, but I found several b'\n'.join or b','.join even in stdlib. 3 lines of difference are only 2.3% of Objects/stringlib/join.h.

Here is a patch with dropped the 1-byte separator case.
msg173416 - (view) Author: Roundup Robot (python-dev) Date: 2012-10-20 21:14
New changeset bfa715f98c0f by Antoine Pitrou in branch 'default':
Issue #12805: Make bytes.join and bytearray.join faster when the separator is empty.
http://hg.python.org/cpython/rev/bfa715f98c0f
msg173417 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-20 21:15
I have committed the simple optimization patch, thank you!
msg173418 - (view) Author: John O'Connor (jcon) Date: 2012-10-20 21:18
I think that's a good compromise and much better than what I had originally.

- John
History
Date User Action Args
2012-10-20 21:18:38jconsetmessages: + msg173418
2012-10-20 21:15:41pitrousetstatus: open -> closed
resolution: fixed
messages: + msg173417

stage: resolved
2012-10-20 21:14:54python-devsetnosy: + python-dev
messages: + msg173416
2012-10-20 18:54:51serhiy.storchakasetfiles: + bytes_join_optimization_3.patch
2012-10-20 18:53:01serhiy.storchakasetmessages: + msg173403
2012-10-20 18:15:54pitrousetmessages: + msg173402
2012-10-20 16:24:42serhiy.storchakasetfiles: + bytes_join_bench.res, bytes_join_bench_2.res

messages: + msg173398
2012-10-20 09:37:21pitrousetmessages: + msg173378
2012-10-20 08:53:27serhiy.storchakasetmessages: + msg173375
versions: + Python 3.4, - Python 3.3
2012-10-19 23:43:45pitrousetmessages: + msg173364
2012-10-19 11:25:31serhiy.storchakasetfiles: + bytes_join_optimization_2.patch, bytes_join_bench.sh

messages: + msg173324
2012-04-08 19:08:25serhiy.storchakasetmessages: + msg157806
2012-04-08 18:37:44terry.reedysetmessages: + msg157804
2012-04-08 18:03:30serhiy.storchakasetmessages: + msg157803
2012-04-08 17:59:16serhiy.storchakasetmessages: + msg157802
2012-04-08 15:14:04pitrousetmessages: + msg157790
2012-04-07 17:28:07serhiy.storchakasetnosy: + serhiy.storchaka
2011-08-28 17:58:31terry.reedysetnosy: + terry.reedy
2011-08-22 01:21:44jconsettitle: Optimzations for bytes.join() et. al -> Optimizations for bytes.join() et. al
2011-08-22 00:43:55jconcreate