Message 248179 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	pitrou, serhiy.storchaka, vstinner
Date	2015-08-07.06:11:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1438927886.02.0.407236930114.issue24821@psf.upfronthosting.co.za>
In-reply-to

Content
Search in strings is highly optimized for common case. However for some input data the search in non-ascii string becomes unexpectedly slow. Compare: $ ./python -m timeit -s 's = "АБВГД"104' -- '"є" in s' 100000 loops, best of 3: 11.7 usec per loop $ ./python -m timeit -s 's = "АБВГД"10**4' -- '"Є" in s' 1000 loops, best of 3: 769 usec per loop It's because the lowest byte of the code of Ukrainian capital letter Є (U+0404) matches the highest byte of codes of most Cyrillic letters (U+04xx). There are similar issues with some other scripts. I think we should use more robust optimization.

Search in strings is highly optimized for common case. However for some input data the search in non-ascii string becomes unexpectedly slow. Compare:

$ ./python -m timeit -s 's = "АБВГД"*10**4' -- '"є" in s'
100000 loops, best of 3: 11.7 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**4' -- '"Є" in s'
1000 loops, best of 3: 769 usec per loop

It's because the lowest byte of the code of Ukrainian capital letter Є (U+0404) matches the highest byte of codes of most Cyrillic letters (U+04xx). There are similar issues with some other scripts.

I think we should use more robust optimization.

History
Date	User	Action	Args
2015-08-07 06:11:26	serhiy.storchaka	set	recipients: + serhiy.storchaka, pitrou, vstinner
2015-08-07 06:11:26	serhiy.storchaka	set	messageid: <1438927886.02.0.407236930114.issue24821@psf.upfronthosting.co.za>
2015-08-07 06:11:25	serhiy.storchaka	link	issue24821 messages
2015-08-07 06:11:24	serhiy.storchaka	create