Message 61837 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	pitrou
Date	2008-01-30.00:44:32
SpamBayes Score	0.0042242007
Marked as misclassified	No
Message-id	<1201653875.04.0.391438542079.issue1970@psf.upfronthosting.co.za>
In-reply-to

Content
Currently the PyUnicode type uses a function call and several lookups per character to detect whitespace and linebreaks. This slows down considerably the split(), rsplit() and splitlines() methods. Since the overwhelming majority of whitespace and linebreaks are ASCII characters, it makes sense to have a fast lookup table for the common case. Patch attached (also with another tiny change which helps compiler optimization of split/rsplit here). (this may also help other methods like strip() a bit, but in that case the impact of whitespace detection is probably negligible) Some numbers: # With patch $ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.splitlines()" 10000 loops, best of 3: 127 usec per loop $ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()" 1000 loops, best of 3: 457 usec per loop # Without patch $ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()" "s.splitlines()" 10000 loops, best of 3: 175 usec per loop $ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()" 1000 loops, best of 3: 571 usec per loop

Currently the PyUnicode type uses a function call and several lookups
per character to detect whitespace and linebreaks. This slows down
considerably the split(), rsplit() and splitlines() methods. Since the
overwhelming majority of whitespace and linebreaks are ASCII characters,
it makes sense to have a fast lookup table for the common case. Patch
attached (also with another tiny change which helps compiler
optimization of split/rsplit here).

(this may also help other methods like strip() a bit, but in that case
the impact of whitespace detection is probably negligible)

Some numbers:

# With patch
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.splitlines()"
10000 loops, best of 3: 127 usec per loop
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 457 usec per loop

# Without patch
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()"
"s.splitlines()"
10000 loops, best of 3: 175 usec per loop
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 571 usec per loop

History
Date	User	Action	Args
2008-01-30 00:44:35	pitrou	set	spambayes_score: 0.0042242 -> 0.0042242007 recipients: + pitrou
2008-01-30 00:44:35	pitrou	set	spambayes_score: 0.0042242 -> 0.0042242 messageid: <1201653875.04.0.391438542079.issue1970@psf.upfronthosting.co.za>
2008-01-30 00:44:33	pitrou	link	issue1970 messages
2008-01-30 00:44:32	pitrou	create