This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author pitrou
Recipients pitrou
Date 2008-01-30.00:44:32
SpamBayes Score 0.0042242007
Marked as misclassified No
Message-id <1201653875.04.0.391438542079.issue1970@psf.upfronthosting.co.za>
In-reply-to
Content
Currently the PyUnicode type uses a function call and several lookups
per character to detect whitespace and linebreaks. This slows down
considerably the split(), rsplit() and splitlines() methods. Since the
overwhelming majority of whitespace and linebreaks are ASCII characters,
it makes sense to have a fast lookup table for the common case. Patch
attached (also with another tiny change which helps compiler
optimization of split/rsplit here).

(this may also help other methods like strip() a bit, but in that case
the impact of whitespace detection is probably negligible)

Some numbers:

# With patch
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.splitlines()"
10000 loops, best of 3: 127 usec per loop
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 457 usec per loop

# Without patch
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()"
"s.splitlines()"
10000 loops, best of 3: 175 usec per loop
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 571 usec per loop
History
Date User Action Args
2008-01-30 00:44:35pitrousetspambayes_score: 0.0042242 -> 0.0042242007
recipients: + pitrou
2008-01-30 00:44:35pitrousetspambayes_score: 0.0042242 -> 0.0042242
messageid: <1201653875.04.0.391438542079.issue1970@psf.upfronthosting.co.za>
2008-01-30 00:44:33pitroulinkissue1970 messages
2008-01-30 00:44:32pitroucreate