Title: str.split creates new string even if pattern not found
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.1, Python 2.7
Status: closed Resolution: out of date
Dependencies: Superseder: [patch] improve unicode methods: split() rsplit() and replace()
View: 7622
Assigned To: effbot Nosy List: effbot, flox, georg.brandl, pitrou, rhettinger
Priority: low Keywords:

Created on 2006-12-11 13:03 by pitrou, last changed 2010-04-01 11:26 by flox. This issue is now closed.

Messages (6)
msg30782 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2006-12-11 13:03

Several string methods avoid allocating a new string when the operation result is trivially the same as one of the parameters (e.g. replacing a non-existing substring). However, split() does not exhibit this optimization, it always constructs a new string even if no splitting occurs:

$ python
Python 2.5 (r25:51908, Oct  6 2006, 15:22:41) 
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "abcde" * 2
>>> id(s)
>>> id(str(s))
>>> id("" + s)
>>> id(s.strip())
>>> id(s.replace("g", "h"))
>>> [id(x) for x in s.partition("h")]
[3084139400L, 3084271768L, 3084271768L]
>>> [id(x) for x in s.split("h")]
msg30783 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2006-12-12 11:35
Ok, I did a patch which partially adds the optimization (the patch is at home, I can't post it right now). I have a few questions though:
 - there is a USE_FAST flag which can bring some speedups when a multicharacter separator is used; however, it is not enabled by default, is there a reason for this?
 - where and by whom is maintained, so that I can propose additional tests for it (namely, tests for unmatched split())?
 - split() implementation is duplicated between str and unicode (the unicode versions having less optimizations), would it be useful to "stringlib'ify" split()?
 - rsplit() does quite similar things as split(), has anyone tried to factor similar parts? do you see any caveats doing so?
msg30784 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-12-12 16:21
Sounds like this is best assigned to Fredrik.
msg30785 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-04-11 17:09
Dropping the priority.  This pay-off is near zero and likely not worth the cost of making the code more complex than it already is.
msg30786 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2007-04-12 09:19

> Dropping the priority.  This pay-off is near zero and likely not worth the
> cost of making the code more complex than it already is.

No problem!
The more interesting question actually was whether it made any sense to factor out the split() implementation in "stringlib" so as to share the implementation between str and unicode.

Also, as for the USE_FAST question you asked on python-dev, I may have an answer: if you try to enable USE_FAST you'll see that some operations are indeed faster on large strings (say 100s or 1000s of characters), but they become slower on small strings because of the larger overhead of the search algorithm. Thus USE_FAST could negatively impact Python programs which process a lot of small strings.
msg102082 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-04-01 11:26
Fixed with #7622, and USE_FAST flag removed.
Date User Action Args
2010-04-01 11:26:43floxsetsuperseder: [patch] improve unicode methods: split() rsplit() and replace()
2010-04-01 11:26:14floxsetstatus: open -> closed

nosy: + flox
messages: + msg102082

resolution: out of date
stage: needs patch -> resolved
2009-03-14 01:45:12pitrousetversions: + Python 3.1, Python 2.7, - Python 2.6, Python 2.5, Python 3.0
type: enhancement -> performance
stage: needs patch
2007-09-20 16:47:01georg.brandlsettype: enhancement
versions: + Python 2.6, Python 3.0
2006-12-11 13:03:23pitroucreate