Issue 42699: Use `.join(k for k in g)` instead of `.join([k for k in g])`

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86865

classification

Title:	Use `.join(k for k in g)` instead of `.join([k for k in g])`
Type:	performance	Stage:	resolved
Components:		Versions:	Python 3.10, Python 3.9, Python 3.8, Python 3.7, Python 3.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	SamuelMarks, eric.smith, josh.r, kj, rhettinger, samuelmarks, steven.daprano
Priority:	normal	Keywords:

Created on 2020-12-21 01:39 by samuelmarks, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 23875	closed	samuelmarks, 2020-12-21 01:39

Messages (16)
msg383474 - (view)	Author: Samuel Marks (samuelmarks) *	Date: 2020-12-21 01:39
This is an extremely minor improvement. Rather than create a `list`—using a comprehension—then have it consumed by `.join`, one can skip the list construction entirely. (I remember this working from at least Python 2.7… probably earlier also)
msg383475 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-12-21 01:41
Do you have any benchmarks to show this is an actual improvement? Often times it is not.
msg383476 - (view)	Author: Samuel Marks (samuelmarks) *	Date: 2020-12-21 01:47
@eric.smith No benchmarks offhand, but I'd expect it to be a very minor improvement (if detectable). If this gets accepted I'll probably do a bunch of little changes like this, to improve things, e.g., replace '%' with '.format' (or f-strings, whatever you prefer), ensure `.iterkeys()`/`.iteritems()` validity, and collapse some obvious `.append` cases with list comprehensions. The idea I'm going off is that when one is debugging their Python code, and it goes across to the Python source, that that Python source code quality is better or equal to the one the higher-level Python developer is creating. Constructing unnecessary lists is one such code quality issue.
msg383483 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2020-12-21 03:14
This is a pessimization given the current implementation of str.join; it calls PySequence_Fast as the very first step, which is effectively free for a tuple or list input (just reference count manipulation), but must convert a generator expression to a list (which is slower than building the list with a listcomp in the first place). It does this so it can do two passes, one to compute the final length (and max ordinal) of the string, allowing it to allocate just once, and one to build the new string. In theory, it might be rewritten to use PyUnicodeWriter under-the-hood for single-pass operation, but as is, a generator expression is slower than a listcomp for this task.
msg383485 - (view)	Author: Ken Jin (kj) *	Date: 2020-12-21 03:50
Sorry for intruding, but I thought I'd offer some rudimentary, non-scientific benchmarks for this: [MSC v.1928 32 bit (Intel)] on win32 # a debug build of python, no compiler optimizations import timeit # gen comp timeit.timeit("''.join(str(_) for _ in range(1000))", number=10000) 11.154560299999957 # list comp timeit.timeit("''.join([str(_) for _ in range(1000)])", number=10000) 9.987510899999961 The list comp is slightly faster than the gen comp. Interestingly, if one were to use python -m timeit instead, the gen comp would show better results since it has a better 'best of 5' timing. IMO, total time is a more accurate representation than best of 5 since the latter gets skewed by outliers.
msg383486 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-12-21 04:37
Sorry Samuel, but this would be a performance degradation. The reason is that the algorithm of str.join makes two passes over the input, so it runs faster when the input is already a list; otherwise, it would have to do the additional work of creating a list.
msg383492 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2020-12-21 07:49
By the way, it is almost always wrong to write "k for k in iterable" when you can just write "iterable" or "list(iterable)". Here are some micro-benchmarks: [steve ~]$ python3.9 -m timeit -s "from string import ascii_letters" "''.join(k for k in ascii_letters)" 100000 loops, best of 5: 2.3 usec per loop [steve ~]$ python3.9 -m timeit -s "from string import ascii_letters" "''.join([k for k in ascii_letters])" 200000 loops, best of 5: 1.57 usec per loop [steve ~]$ python3.9 -m timeit -s "from string import ascii_letters" "''.join(list(ascii_letters))" 500000 loops, best of 5: 749 nsec per loop [steve ~]$ python3.9 -m timeit -s "from string import ascii_letters" "''.join(ascii_letters)" 500000 loops, best of 5: 737 nsec per loop
msg383497 - (view)	Author: Samuel Marks (samuelmarks) *	Date: 2020-12-21 09:19
Yeah I hear ya, was just trying for the most concise issue title. I tend to [over]use `map`, `filter`, `filterfalse` and other `itertools` and `operator` methods in my own codebase. Surprised with that result, that using an explicit list is actually faster. Seems like an obvious* micro-optimisation. *But don't want to say that unless I'm actually maintaining/contributing-to your C code. It's also surprising—last time I checked—that lists are faster to construct than tuples. When I create codebases or maintain other peoples, I try and: - remove all unnecessary mutability (incl. replacing lists with tuples); - flatten `.append` occurrences into generator comprehensions or map; - remove all indentation creating for-loops, replacing with comprehensions or map and functions or lambdas - combine generators rather than concatenate lists; The general idea here is to evaluate at the time of computation, and be lazy everywhere else. So searching the whole codebase for lists and other mutable structures is a good first step. But maybe with efficiency losses, like shown here, means that this would only aid [maybe] in readability, understandability, traceability & whatever other functional -ility; but not performance? :(
msg383499 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-12-21 09:32
@samuelmarks: A place where there it is possible to improve performance is with f-strings replacing %-formatting or str.format. This does move significant work to compile time. However, we'd be unlikely to accept a wholesale stdlib change that swaps in f-strings. Instead, if there were specific places where benchmarks showed real world improvements, we should look at those on a case-by-case basis. Also note that replacing %-formatting with .format() is almost always a performance pessimization.
msg383500 - (view)	Author: Samuel Marks (samuelmarks) *	Date: 2020-12-21 09:38
Wait I don't understand why you wouldn't accept a wholesale replacement of all `%` and `format` with f-strings through the entire CPython codebase [master branch]? BTW: Kinda unrelated, but would be great to have perspective on this little project - https://github.com/SamuelMarks/doctrans - I'm interested in wholesale enforcement of code consistency and translating between constructs (e.g., dataclass to/fro argparse to/fro class method; ReST to/fro Google to/fro NumPy dostrings; inline types to/fro types in docstrings; with more planned)
msg383508 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-12-21 10:30
> Wait I don't understand why you wouldn't accept a wholesale replacement of all `%` and `format` with f-strings through the entire CPython codebase [master branch]? For such a large change it's difficult to review every single change and ensure it's correct. I'm guessing there are thousands of occurrences. In addition, it runs the risk of breaking any existing pull requests. And in the vast majority of cases it wouldn't make any noticeable performance difference. So why risk the breakage and endure the hassle?
msg383513 - (view)	Author: Samuel Marks (samuelmarks) *	Date: 2020-12-21 11:51
I suppose that's a good justification to never improve/upgrade the syntax and quality of the codebase. In terms of automatic upgrades of the codebase, one could always replicate the approach I use in doctrans—i.e., use of `ast` and/or `inspect`—to automatically upgrade syntax from `%` and `.format` to use f-strings. Would that be acceptable?
msg383514 - (view)	Author: Samuel Marks (samuelmarks) *	Date: 2020-12-21 11:52
EDIT: Just found https://github.com/ikamensh/flynt
msg383518 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-12-21 13:22
See https://github.com/ikamensh/flynt#dangers-of-conversion for reasons. Would I like to see all string literal formatting done with f-strings? Yes! Would I accept the risk and hassle of doing it blindly? No.
msg383553 - (view)	Author: Samuel Marks (SamuelMarks) *	Date: 2020-12-21 21:11
There were only 12k occurrences, I'm sure I could manually go through that in an afternoon. Would you accept it then? On Tue, 22 Dec 2020, 12:22 am Eric V. Smith, <report@bugs.python.org> wrote: > > Eric V. Smith <eric@trueblade.com> added the comment: > > See https://github.com/ikamensh/flynt#dangers-of-conversion for reasons. > > Would I like to see all string literal formatting done with f-strings? Yes! > > Would I accept the risk and hassle of doing it blindly? No. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue42699> > _______________________________________ >
msg383554 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2020-12-21 21:20
On Mon, Dec 21, 2020 at 09:11:48PM +0000, Samuel Marks wrote: > There were only 12k occurrences, I'm sure I could manually go through that > in an afternoon. Would you accept it then? Assuming "an afternoon" is half a work day, so 4 hours, that's 1.2 seconds per occurrence. So no, not even if you were a trusted core developer.

History
Date	User	Action	Args
2022-04-11 14:59:39	admin	set	github: 86865
2020-12-21 21:20:33	steven.daprano	set	messages: + msg383554
2020-12-21 21:11:48	SamuelMarks	set	nosy: + SamuelMarks messages: + msg383553
2020-12-21 13:22:09	eric.smith	set	messages: + msg383518
2020-12-21 11:52:43	samuelmarks	set	messages: + msg383514
2020-12-21 11:51:02	samuelmarks	set	messages: + msg383513
2020-12-21 10:30:13	eric.smith	set	messages: + msg383508
2020-12-21 09:38:03	samuelmarks	set	messages: + msg383500
2020-12-21 09:32:24	eric.smith	set	messages: + msg383499
2020-12-21 09:19:56	samuelmarks	set	messages: + msg383497
2020-12-21 07:49:44	steven.daprano	set	nosy: + steven.daprano messages: + msg383492
2020-12-21 04:37:54	rhettinger	set	status: open -> closed nosy: + rhettinger messages: + msg383486 resolution: not a bug stage: resolved
2020-12-21 03:50:30	kj	set	nosy: + kj messages: + msg383485
2020-12-21 03:14:09	josh.r	set	nosy: + josh.r messages: + msg383483
2020-12-21 01:47:08	samuelmarks	set	messages: + msg383476
2020-12-21 01:41:50	eric.smith	set	nosy: + eric.smith messages: + msg383475
2020-12-21 01:39:34	samuelmarks	create