Issue37512
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2019-07-05 21:33 by dmitriym, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (5) | |||
---|---|---|---|
msg347390 - (view) | Author: Dmitriy (dmitriym) | Date: 2019-07-05 21:33 | |
There is an error in documentation about string concatenation: https://docs.python.org/3/library/stdtypes.html --- Common Sequence Operations [...] 6. Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below: - if concatenating str objects, you can build a list and use str.join() at the end or else write to an io.StringIO instance and retrieve its value when complete --- It is not true for str objects anymore. Example using timeit module shows that time grows linearly: >>> timeit('a+="a"', setup='a=""', number=10000) 0.0005754000012530014 >>> timeit('a+="a"', setup='a=""', number=100000) 0.005819800004246645 But for bytes it is still right: >>> timeit('a+=b"a"', setup='a=b""', number=10000) 0.0017669000080786645 >>> timeit('a+=b"a"', setup='a=b""', number=100000) 0.20758410000416916 |
|||
msg347400 - (view) | Author: Eric V. Smith (eric.smith) * | Date: 2019-07-05 22:32 | |
It's my understanding that this is a quality of implementation issue, and that in other (non-CPython) implementations, the run time for repeated concatenation may indeed be quadratic. The optimization in CPython relies on knowing the reference count is 1. If CPython were to switch away from reference counting, I would expect the behavior of repeated concatenation to be quadratic again. I'm not sure if the deserves a documentation note or not. |
|||
msg347412 - (view) | Author: Steven D'Aprano (steven.daprano) * | Date: 2019-07-06 01:14 | |
Eric is correct that this is a CPython optimization, it is not a language feature. Furthermore, it is an optimization that can be affected by rather subtle factors such as the operating system memory management. Here is a thread demonstrating that code that relied on this optimization ran fast on Linux and slow on Windows, literally hundreds of times slower than other tools: https://mail.python.org/archives/list/python-dev@python.org/thread/EQO6HDZXHFCP4DBEE5Q7MYLHPJMGJ7DQ/ If you prefer the old mailman archives, here are a few relevant posts: https://mail.python.org/pipermail/python-dev/2009-August/091125.html https://mail.python.org/pipermail/python-dev/2009-September/091582.html https://mail.python.org/pipermail/python-dev/2009-September/091592.html Best practice is to *not* rely on this optimization, but to use str.join. That ensures that: - your code remains fast if run on alternate implementations; - your code remains fast if the OS memory management changes; - your code remains fast if you change your code slightly. For example, I expect both of the following to show the documented quadratic behaviour: - Prepending instead of appending - keeping two or more references to the string being appended to partials = [] current = '' for s in list_of_strings: partials.append(current) current += s There may be others. Perhaps we ought to document this optimization specifically so we can suggest that it is intended for improving performance of casual scripts, and that *best practice* for professional quality code remains str.join. |
|||
msg347428 - (view) | Author: Dmitriy (dmitriym) | Date: 2019-07-06 11:32 | |
Yes, optimization is really not working in case of prepending. In case of multiple references I couldn't get quadratic time grow. Concerning the Windows, yes, the optimization may be not always efficient: >>> timeit('a+="a"', setup='a=""', number=10000) 0.0011690999999984797 >>> timeit('a+="a"', setup='a=""', number=100000) 0.01114439999999206 >>> timeit('a+="a"', setup='a=""', number=1000000) 0.10783829999999739 >>> timeit('a+="a"', setup='a=""', number=10000000) 5.636337499999996 As I understand this is the case related to OS memory management. But on Linux I got fairly predictable results: >>> timeit('a+="a"', setup='a=""', number=10000) 0.0006532900151796639 >>> timeit('a+="a"', setup='a=""', number=100000) 0.006340583000564948 >>> timeit('a+="a"', setup='a=""', number=1000000) 0.06438201799755916 >>> timeit('a+="a"', setup='a=""', number=10000000) 0.6354853530065157 >>> timeit('a+="a"', setup='a=""', number=100000000) 6.365498173021479 Also I have found the mention about optimization in PEP8 https://www.python.org/dev/peps/pep-0008/#programming-recommendations So maybe it would be nice to add some notes or reference to the part of upper PEP in docs about optimizations in CPython to make it more clear. |
|||
msg347505 - (view) | Author: Brett Cannon (brett.cannon) * | Date: 2019-07-08 19:07 | |
I'm going to close this as we try to avoid documenting CPython-specific details unless absolutely necessary. And in the case of documenting the str type we should avoid that. So thanks for the suggestion, Dmitriy, but we will just leave this as a pleasant surprise for folks who accidentally make the mistake of writing code this way. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:17 | admin | set | github: 81693 |
2019-07-08 19:07:58 | brett.cannon | set | status: open -> closed nosy: + brett.cannon messages: + msg347505 resolution: rejected stage: resolved |
2019-07-06 11:32:57 | dmitriym | set | messages: + msg347428 |
2019-07-06 01:14:12 | steven.daprano | set | nosy:
+ steven.daprano messages: + msg347412 |
2019-07-05 22:32:02 | eric.smith | set | nosy:
+ eric.smith messages: + msg347400 |
2019-07-05 21:33:18 | dmitriym | create |