classification
Title: Speedup unicode whitespace and linebreak detection
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, lemburg, pitrou
Priority: normal Keywords: patch

Created on 2008-01-30 00:44 by pitrou, last changed 2008-01-30 11:33 by christian.heimes. This issue is now closed.

Files
File name Uploaded Description Edit
unispace.patch pitrou, 2008-01-30 00:44
trunk_unispace.patch christian.heimes, 2008-01-30 10:25
Messages (10)
msg61837 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-30 00:44
Currently the PyUnicode type uses a function call and several lookups
per character to detect whitespace and linebreaks. This slows down
considerably the split(), rsplit() and splitlines() methods. Since the
overwhelming majority of whitespace and linebreaks are ASCII characters,
it makes sense to have a fast lookup table for the common case. Patch
attached (also with another tiny change which helps compiler
optimization of split/rsplit here).

(this may also help other methods like strip() a bit, but in that case
the impact of whitespace detection is probably negligible)

Some numbers:

# With patch
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.splitlines()"
10000 loops, best of 3: 127 usec per loop
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 457 usec per loop

# Without patch
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()"
"s.splitlines()"
10000 loops, best of 3: 175 usec per loop
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 571 usec per loop
msg61847 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-30 09:53
Sounds interesting and good!
msg61849 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-30 10:09
Nice patch !
msg61850 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-30 10:10
This should also be backported to Py2.6.
msg61851 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-30 10:24
I agree! The new patch applies cleanly to the trunk. I've fixed some
white spaces and renamed the tables to _Py_ascii_....
msg61852 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-30 10:25
Sorry, this patch doesn't contain my current work.
msg61853 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-30 10:42
Please make those stables static... 

In general, everything that's not needed outside an object file should
be made static to avoid naming conflicts. For static symbols, there's no
need to prefix them with any "Py" indicator.
msg61856 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-30 11:07
> Please make those stables static... 
> 
> In general, everything that's not needed outside an object file should
> be made static to avoid naming conflicts. For static symbols, there's no
> need to prefix them with any "Py" indicator.

The ascii whitespace table is required for Py_UNICODE_ISSPACE. I can
make the linebreak table static but I can't make the whitespace table
static.
msg61857 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-01-30 11:09
Ok, thanks.
msg61861 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-30 11:33
I've applied the patch to the trunk in r60440. It will be merged into
3.0 soonish. Thanks for your work Keep it going! :)
History
Date User Action Args
2008-01-30 11:33:26christian.heimessetstatus: open -> closed
resolution: fixed
messages: + msg61861
2008-01-30 11:09:49lemburgsetmessages: + msg61857
2008-01-30 11:07:02christian.heimessetmessages: + msg61856
2008-01-30 10:42:59lemburgsetmessages: + msg61853
2008-01-30 10:25:44christian.heimessetfiles: - trunk_unispace.patch
2008-01-30 10:25:39christian.heimessetfiles: + trunk_unispace.patch
messages: + msg61852
2008-01-30 10:24:46christian.heimessetfiles: + trunk_unispace.patch
messages: + msg61851
2008-01-30 10:10:02lemburgsetmessages: + msg61850
versions: + Python 2.6
2008-01-30 10:09:38lemburgsetnosy: + lemburg
messages: + msg61849
2008-01-30 09:53:26christian.heimessetpriority: normal
keywords: + patch
type: enhancement
messages: + msg61847
nosy: + christian.heimes
2008-01-30 00:44:34pitroucreate