Issue 12805: Optimizations for bytes.join() et. al

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57014

classification

Title:	Optimizations for bytes.join() et. al
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, ezio.melotti, jcon, pitrou, python-dev, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2011-08-22 00:43 by jcon, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bytes_join_optimization.patch	jcon, 2011-08-22 00:43		review
bytes_join_optimization_2.patch	serhiy.storchaka, 2012-10-19 11:25	Optimization limited by the heuristics	review
bytes_join_bench.sh	serhiy.storchaka, 2012-10-19 11:25	Benchmark script
bytes_join_bench.res	serhiy.storchaka, 2012-10-20 16:24	Benchmark results on AMD Athlon 64 X2 4600+
bytes_join_bench_2.res	serhiy.storchaka, 2012-10-20 16:24	Benchmark results on Intel Atom N570 @ 1.66GHz
bytes_join_optimization_3.patch	serhiy.storchaka, 2012-10-20 18:54	Only empty separator case left	review

Messages (16)
msg142650 - (view)	Author: John O'Connor (jcon)	Date: 2011-08-22 00:43
I noticed there are a few more common cases that can be sped up in bytes.join(). When: - The sequence length and separator length are both 0 - The separator length is 0 or 1 - The separator string contains only 1 distinct byte These could also be applied to other sequence types. for i in {{0..5}}; do ./python -m timeit -s "f=open('/usr/share/dict/words', 'rb'); L=f.readlines()" \ "b' '.join(L)"; done - BEFORE - 100 loops, best of 3: 3.79 msec per loop 100 loops, best of 3: 3.5 msec per loop 100 loops, best of 3: 3.75 msec per loop - AFTER - 100 loops, best of 3: 2.81 msec per loop 100 loops, best of 3: 2.81 msec per loop 100 loops, best of 3: 3.03 msec per loop ~20% speedup Same test on a smaller list (lines from LICENSE file) yields a similar 15-20% speedup. Same test on L = [b'1', b'2', b'3'] yields 10% speedup
msg157790 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-04-08 15:14
Honestly, I don't think there's much point in these optimizations. The first one ("The sequence length and separator length are both 0") is probably useless.
msg157802 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-08 17:59
> Honestly, I don't think there's much point in these optimizations. The first one ("The sequence length and separator length are both 0") is probably useless. This optimization changes the behavior. ... def __repr__(self): return 'B(' + super().__repr__() + ')' ... >>> B() B(b'') >>> B().join([]) b'' With the patch we get B(b''). Regardless of whether optimization (which is negligible in this case), I don't know whether to consider such side effect desirable or undesirable. bytes.join need to optimize for the 0- and 1-byte delimiter, but the proposed patch leads to pessimization: Unpatched: $ ./python -m timeit -s 'seq=[bytes([i]100000) for i in range(256)]' 'b"".join(seq)' 10 loops, best of 3: 43.3 msec per loop Patched: $ ./python -m timeit -s 'seq=[bytes([i]100000) for i in range(256)]' 'b" ".join(seq)' 10 loops, best of 3: 70.3 msec per loop
msg157803 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-08 18:03
> Regardless of whether optimization (which is negligible in this case), I don't know whether to consider such side effect desirable or undesirable. After some thought, I realized that this is an erroneous behavior.
msg157804 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-04-08 18:37
Side note: Windows requires that args be quoted with ", not ', to work properly, at least with these args. Main note: the patched test adds a space to the separator, but that is not enough to account for the difference. c:\Programs\Python32>python -m timeit -s "seq=[bytes([i]1000) for i in range(256)]" "b''.join(seq)" 10000 loops, best of 3: 31.7 usec per loop c:\Programs\Python32>python -m timeit -s "seq=[bytes([i]1000) for i in range(256)]" "b' '.join(seq)" 10000 loops, best of 3: 34.1 usec per loop The behavior change is wrong and test_bytes.py seems to need augmentation. It begins with "XXX This is a mess. [...]". class BaseBytesTest(unittest.TestCase): def test_join(self) tests b''.join([]) but as far as I can tell, not b'something'.join([]), the failing case found by Serhiy. It end with " # XXX more...".
msg157806 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-08 19:08
> Main note: the patched test adds a space to the separator, but that is not enough to account for the difference. Sorry, I copied the wrong line. $ ./python -m timeit -s "seq=[bytes([i]*100000) for i in range(256)]" "b' '.join(seq)" 10 loops, best of 3: 45.6 msec per loop But the conclusions are not affected, for larger byte strings patched version will be 1.5-2 times slower in limit.
msg173324 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-19 11:25
I adapted a patch to the current code (now it works for bytearrays too), deleted the behavior change, left optimization only for 0- and 1-bytes separator, deleted tests (bytes.join tests already exists), wrote a benchmark and after long experiments has set a condition for this optimization. Now, I believe, the patch never leads to regression. The patch has been simplified, it's only 13 lines long. Thank you, John O'Connor.
msg173364 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-19 23:43
I don't understand the point of this optimization. It looks rather cryptic and very specific. Can you elaborate?
msg173375 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-20 08:53
In a normal loop we copy into the output buffer for each iteration the element and the separator. As you know, the processor can copy large amounts of data several times more efficiently than byte-for-byte. Therefore, filling the buffer by memset more efficient than calling Py_MEMCPY on each iteration if the buffer size is not too large. Experimenting on different machines I found that Py_MEMCPY is more expensive than sequential filling up to 32 bytes (may be even more, but I chose a conservative estimate). And for empty separator we can just omit copying and filling (it saves several checks per iteration).
msg173378 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-20 09:37
Ok, but does it really make a difference and in which cases? In other words, do you have benchmark numbers? :)
msg173398 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-20 16:24
> Ok, but does it really make a difference and in which cases? Up to 40% on Athlon and up to 70% on Atom. > In other words, do you have benchmark numbers? :) Attached results for AMD Athlon 64 X2 4600+ and Intel Atom N570 @ 1.66GHz under 32-bit Linux. You can run the benchmark on your machine (requires Posix- compatible shell, sed awk).
msg173402 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-20 18:15
Ok, here are the benchmark results here with a 1-byte separator: 10 x 10 0.244 usec 0.202 usec +21% 100 x 10 0.325 usec 0.326 usec -0% 1000 x 10 0.691 usec 0.689 usec +0% 10 x 1000 18.2 usec 14.2 usec +28% 100 x 1000 39.8 usec 40.6 usec -2% 1000 x 1000 94.6 usec 96 usec -1% and with an empty separator: 10 x 10 0.245 usec 0.198 usec +24% 100 x 10 0.335 usec 0.286 usec +17% 1000 x 10 0.637 usec 0.593 usec +7% 10 x 1000 18.9 usec 14.1 usec +34% 100 x 1000 40.3 usec 36.2 usec +11% 1000 x 1000 93.7 usec 96.9 usec -3% (Core i5 2500K, 64-bit, gcc) I would say the empty separator case is interesting, because a common use case for bytes.join() is indeed fast concatenation. However, the 1-byte separator case could be dropped, which would simplify the patch and the heuristic.
msg173403 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-20 18:53
Indeed, 1-char separator is more common for strings, but I found several b'\n'.join or b','.join even in stdlib. 3 lines of difference are only 2.3% of Objects/stringlib/join.h. Here is a patch with dropped the 1-byte separator case.
msg173416 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-10-20 21:14
New changeset bfa715f98c0f by Antoine Pitrou in branch 'default': Issue #12805: Make bytes.join and bytearray.join faster when the separator is empty. http://hg.python.org/cpython/rev/bfa715f98c0f
msg173417 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-20 21:15
I have committed the simple optimization patch, thank you!
msg173418 - (view)	Author: John O'Connor (jcon)	Date: 2012-10-20 21:18
I think that's a good compromise and much better than what I had originally. - John

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 57014
2012-10-20 21:18:38	jcon	set	messages: + msg173418
2012-10-20 21:15:41	pitrou	set	status: open -> closed resolution: fixed messages: + msg173417 stage: resolved
2012-10-20 21:14:54	python-dev	set	nosy: + python-dev messages: + msg173416
2012-10-20 18:54:51	serhiy.storchaka	set	files: + bytes_join_optimization_3.patch
2012-10-20 18:53:01	serhiy.storchaka	set	messages: + msg173403
2012-10-20 18:15:54	pitrou	set	messages: + msg173402
2012-10-20 16:24:42	serhiy.storchaka	set	files: + bytes_join_bench.res, bytes_join_bench_2.res messages: + msg173398
2012-10-20 09:37:21	pitrou	set	messages: + msg173378
2012-10-20 08:53:27	serhiy.storchaka	set	messages: + msg173375 versions: + Python 3.4, - Python 3.3
2012-10-19 23:43:45	pitrou	set	messages: + msg173364
2012-10-19 11:25:31	serhiy.storchaka	set	files: + bytes_join_optimization_2.patch, bytes_join_bench.sh messages: + msg173324
2012-04-08 19:08:25	serhiy.storchaka	set	messages: + msg157806
2012-04-08 18:37:44	terry.reedy	set	messages: + msg157804
2012-04-08 18:03:30	serhiy.storchaka	set	messages: + msg157803
2012-04-08 17:59:16	serhiy.storchaka	set	messages: + msg157802
2012-04-08 15:14:04	pitrou	set	messages: + msg157790
2012-04-07 17:28:07	serhiy.storchaka	set	nosy: + serhiy.storchaka
2011-08-28 17:58:31	terry.reedy	set	nosy: + terry.reedy
2011-08-22 01:21:44	jcon	set	title: Optimzations for bytes.join() et. al -> Optimizations for bytes.join() et. al
2011-08-22 00:43:55	jcon	create