Issue 3460: PyUnicode_Join could perhaps be simpler

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47710

classification

Title:	PyUnicode_Join could perhaps be simpler
Type:	performance	Stage:
Components:	Unicode	Versions:	Python 3.0

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	pitrou	Nosy List:	lemburg, pitrou
Priority:	normal	Keywords:	patch

Created on 2008-07-28 19:04 by pitrou, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
strjoin3k.patch	pitrou, 2008-07-29 12:50

Messages (5)
msg70367 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-28 19:04
In py3k, PyUnicode_Join inherits some complexity from the 2.x days. However, it seems some of the precautions taken there may not be needed anymore. Witness the following comment: /* Grrrr. A codec may be invoked to convert str objects to * Unicode, and so it's possible to call back into Python code * during PyUnicode_FromObject(), and so it's possible for a sick * codec to change the size of fseq (if seq is a list). Therefore * we have to keep refetching the size -- can't assume seqlen * is invariant. */ Perhaps it would also allow to preallocate the target buffer all at once (like bytes.join does) rather than resize it incrementally. Marc-Andre, what do you think?
msg70381 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-07-29 09:06
The comment gives a wrong impression: The problem is not (only) that a codec might by evil, it's the fact that a codec may well execute Python code and thus allow the list to be changed by other threads during the operation. Now, since in Python 3.x codecs are no longer being invoked, it is probably safe to assume that Python code is not being executed while PyUnicode_Join() is running, but please double-check. It's also wise to apply a sanity check at the end of the loop to check whether the sequence length has indeed not changed (as assert maybe).
msg70385 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-29 10:41
Well the potentially dangerous function would have been PyUnicode_FromObject, but in py3k it only accepts unicode instances (either exact or subclasses), and since we are only interested in the underlying buffer we can replace those calls with PyUnicode_Check. I'll work on a patch and keep you updated.
msg70388 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-29 12:50
Here is a patch. On my measurements it makes str.join() 30% to 50% faster on non-trivial input.
msg70863 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-07 21:51
I've committed the patch in r65583.

History
Date	User	Action	Args
2022-04-11 14:56:37	admin	set	github: 47710
2008-08-07 21:51:20	pitrou	set	status: open -> closed resolution: fixed messages: + msg70863
2008-07-29 12:50:12	pitrou	set	files: + strjoin3k.patch keywords: + patch messages: + msg70388
2008-07-29 10:41:12	pitrou	set	assignee: pitrou messages: + msg70385
2008-07-29 09:06:28	lemburg	set	messages: + msg70381
2008-07-28 19:04:38	pitrou	create