Issue 37512: Error in the documentation about string concatenation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81693

classification

Title:	Error in the documentation about string concatenation
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	brett.cannon, dmitriym, docs@python, eric.smith, steven.daprano
Priority:	normal	Keywords:

Created on 2019-07-05 21:33 by dmitriym, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg347390 - (view)	Author: Dmitriy (dmitriym)	Date: 2019-07-05 21:33
There is an error in documentation about string concatenation: https://docs.python.org/3/library/stdtypes.html --- Common Sequence Operations [...] 6. Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below: - if concatenating str objects, you can build a list and use str.join() at the end or else write to an io.StringIO instance and retrieve its value when complete --- It is not true for str objects anymore. Example using timeit module shows that time grows linearly: >>> timeit('a+="a"', setup='a=""', number=10000) 0.0005754000012530014 >>> timeit('a+="a"', setup='a=""', number=100000) 0.005819800004246645 But for bytes it is still right: >>> timeit('a+=b"a"', setup='a=b""', number=10000) 0.0017669000080786645 >>> timeit('a+=b"a"', setup='a=b""', number=100000) 0.20758410000416916
msg347400 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2019-07-05 22:32
It's my understanding that this is a quality of implementation issue, and that in other (non-CPython) implementations, the run time for repeated concatenation may indeed be quadratic. The optimization in CPython relies on knowing the reference count is 1. If CPython were to switch away from reference counting, I would expect the behavior of repeated concatenation to be quadratic again. I'm not sure if the deserves a documentation note or not.
msg347412 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2019-07-06 01:14
Eric is correct that this is a CPython optimization, it is not a language feature. Furthermore, it is an optimization that can be affected by rather subtle factors such as the operating system memory management. Here is a thread demonstrating that code that relied on this optimization ran fast on Linux and slow on Windows, literally hundreds of times slower than other tools: https://mail.python.org/archives/list/python-dev@python.org/thread/EQO6HDZXHFCP4DBEE5Q7MYLHPJMGJ7DQ/ If you prefer the old mailman archives, here are a few relevant posts: https://mail.python.org/pipermail/python-dev/2009-August/091125.html https://mail.python.org/pipermail/python-dev/2009-September/091582.html https://mail.python.org/pipermail/python-dev/2009-September/091592.html Best practice is to not rely on this optimization, but to use str.join. That ensures that: - your code remains fast if run on alternate implementations; - your code remains fast if the OS memory management changes; - your code remains fast if you change your code slightly. For example, I expect both of the following to show the documented quadratic behaviour: - Prepending instead of appending - keeping two or more references to the string being appended to partials = [] current = '' for s in list_of_strings: partials.append(current) current += s There may be others. Perhaps we ought to document this optimization specifically so we can suggest that it is intended for improving performance of casual scripts, and that best practice for professional quality code remains str.join.
msg347428 - (view)	Author: Dmitriy (dmitriym)	Date: 2019-07-06 11:32
Yes, optimization is really not working in case of prepending. In case of multiple references I couldn't get quadratic time grow. Concerning the Windows, yes, the optimization may be not always efficient: >>> timeit('a+="a"', setup='a=""', number=10000) 0.0011690999999984797 >>> timeit('a+="a"', setup='a=""', number=100000) 0.01114439999999206 >>> timeit('a+="a"', setup='a=""', number=1000000) 0.10783829999999739 >>> timeit('a+="a"', setup='a=""', number=10000000) 5.636337499999996 As I understand this is the case related to OS memory management. But on Linux I got fairly predictable results: >>> timeit('a+="a"', setup='a=""', number=10000) 0.0006532900151796639 >>> timeit('a+="a"', setup='a=""', number=100000) 0.006340583000564948 >>> timeit('a+="a"', setup='a=""', number=1000000) 0.06438201799755916 >>> timeit('a+="a"', setup='a=""', number=10000000) 0.6354853530065157 >>> timeit('a+="a"', setup='a=""', number=100000000) 6.365498173021479 Also I have found the mention about optimization in PEP8 https://www.python.org/dev/peps/pep-0008/#programming-recommendations So maybe it would be nice to add some notes or reference to the part of upper PEP in docs about optimizations in CPython to make it more clear.
msg347505 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2019-07-08 19:07
I'm going to close this as we try to avoid documenting CPython-specific details unless absolutely necessary. And in the case of documenting the str type we should avoid that. So thanks for the suggestion, Dmitriy, but we will just leave this as a pleasant surprise for folks who accidentally make the mistake of writing code this way.

History
Date	User	Action	Args
2022-04-11 14:59:17	admin	set	github: 81693
2019-07-08 19:07:58	brett.cannon	set	status: open -> closed nosy: + brett.cannon messages: + msg347505 resolution: rejected stage: resolved
2019-07-06 11:32:57	dmitriym	set	messages: + msg347428
2019-07-06 01:14:12	steven.daprano	set	nosy: + steven.daprano messages: + msg347412
2019-07-05 22:32:02	eric.smith	set	nosy: + eric.smith messages: + msg347400
2019-07-05 21:33:18	dmitriym	create