classification
Title: PyUnicode_Join could perhaps be simpler
Type: performance Stage:
Components: Unicode Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: lemburg, pitrou
Priority: normal Keywords: patch

Created on 2008-07-28 19:04 by pitrou, last changed 2008-08-07 21:51 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
strjoin3k.patch pitrou, 2008-07-29 12:50
Messages (5)
msg70367 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-28 19:04
In py3k, PyUnicode_Join inherits some complexity from the 2.x days.
However, it seems some of the precautions taken there may not be needed
anymore. Witness the following comment:

    /* Grrrr.  A codec may be invoked to convert str objects to
     * Unicode, and so it's possible to call back into Python code
     * during PyUnicode_FromObject(), and so it's possible for a sick
     * codec to change the size of fseq (if seq is a list).  Therefore
     * we have to keep refetching the size -- can't assume seqlen
     * is invariant.
     */

Perhaps it would also allow to preallocate the target buffer all at once
(like bytes.join does) rather than resize it incrementally.
Marc-Andre, what do you think?
msg70381 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-07-29 09:06
The comment gives a wrong impression: The problem is not (only) that a
codec might by evil, it's the fact that a codec may well execute Python
code and thus allow the list to be changed by other threads during the
operation.

Now, since in Python 3.x codecs are no longer being invoked, it is
probably safe to assume that Python code is not being executed while
PyUnicode_Join() is running, but please double-check.

It's also wise to apply a sanity check at the end of the loop to check
whether the sequence length has indeed not changed (as assert maybe).
msg70385 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-29 10:41
Well the potentially dangerous function would have been
PyUnicode_FromObject, but in py3k it only accepts unicode instances
(either exact or subclasses), and since we are only interested in the
underlying buffer we can replace those calls with PyUnicode_Check.
I'll work on a patch and keep you updated.
msg70388 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-29 12:50
Here is a patch. On my measurements it makes str.join() 30% to 50%
faster on non-trivial input.
msg70863 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-07 21:51
I've committed the patch in r65583.
History
Date User Action Args
2008-08-07 21:51:20pitrousetstatus: open -> closed
resolution: fixed
messages: + msg70863
2008-07-29 12:50:12pitrousetfiles: + strjoin3k.patch
keywords: + patch
messages: + msg70388
2008-07-29 10:41:12pitrousetassignee: pitrou
messages: + msg70385
2008-07-29 09:06:28lemburgsetmessages: + msg70381
2008-07-28 19:04:38pitroucreate