This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Large regex handling very slow on Linux
Type: resource usage Stage: resolved
Components: Regular Expressions Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: exarkun, ezio.melotti, michael.foord, mrabarnett, olivers, vstinner
Priority: normal Keywords:

Created on 2010-03-04 23:14 by olivers, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
regextest.py olivers, 2010-03-04 23:14
Messages (9)
msg100434 - (view) Author: Oliver Sturm (olivers) Date: 2010-03-04 23:14
The code in regextest.py (attached) uses a large regex to analyze a piece of text. I have tried this test program on two Macs, using the standard Python distributions.

On a MacBook, 2.4 GHz dual core, Snow Leopard with Python 2.6.1, it takes 0.08 seconds
On a MacPro, 2.something GHz 8 core, Leopard with Python 2.5.1, it takes 0.09 seconds

Now I've also tried it on several Linux machines, all of them running Ubuntu. They are all extremely slow. The machine I've been testing with now is a 2.0 GHz dual core machine with Python 2.5.2. A test run of the program takes 97.5 (that's not a typo) seconds on this machine, 1200 times as long as on the Macs.
msg100435 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2010-03-04 23:25
I think it's likely that the test program does drastically different things on Linux than it does on OS X:

Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regextest
>>> len(regextest.makew())
93455
>>>

Compare to:

Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regextest
>>> len(regextest.makew())
47672
>>> 


If I modify it to use 65535 (sys.maxunicode on OS X) and then run it on Linux, it completes quite quickly.
msg100436 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-04 23:27
Results on Linux (Debian Sid) with different Python versions:
 * Python 2.4.6: 14112.8 ms
 * Python 2.5.5: 14246.7 ms
 * Python 2.6.4+: 14753.4 ms
 * Python trunk (2.7a3+): 69.3 ms

It looks like re engine was optimized in trunk :-)

Note: I replaced stopwatch by time.time() and used 100 iterations instead of 500 iterations. Values are not important, only the ratio between the different results.

--

exarkun> I think it's likely that the test program does drastically
exarkun> different things on Linux than it does on OS X

sys.maxunicode should be different on the two OS.
msg100437 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-04 23:29
Ooops, my benchmark was wrong. It looks like the result depends sys.maxunicode:

$ python2.4 -c "import sys; print sys.maxunicode"
1114111
$ python2.5 -c "import sys; print sys.maxunicode"
1114111
$ python2.6 -c "import sys; print sys.maxunicode"
1114111
$ ./python -c "import sys; print sys.maxunicode"
65535

The last on (./python) is Python trunk.
msg100438 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010-03-04 23:33
So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?
msg100439 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2010-03-04 23:37
> So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?

No, but it's probably reasonable / unavoidable that a more complex regex should be some number of times slower than a simpler regex.

On Linux, the regex being constructed is more complex.  On OS X, it's simpler.  The reason for the difference is that the Linux build is UCS4, but that's only because the unicode character width is being used as part of the function that constructs the regular expression.

If you take this variable out of that function, so that it returns the same string regardless of the width of a unicode character, then performance evens out.
msg100441 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-03-05 00:00
A workaround could be using [^\W\d], but this includes some extra chars in the categories Pc, Nl, and No that maybe you don't want. Generate a list of chars in these 3 categories and add them in the regex should be cheaper though.
Since this is not a bug of Python I'll close this issue.
msg100447 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-03-05 02:14
This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds):
>>> s = u''.join(unichr(c) for c in range(sys.maxunicode))
>>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[%s_-]' % makew(), s, re.U))
>>> diff.remove('-')
>>> re.findall(u'(?:[^\W\d%s]|-)' % ''.join(diff), s, re.U) == re.findall(u'[%s_-]' % makew(), s, re.U)
True

(I don't like the way I included the '-' but I couldn't find anything better.)
It looks however that most of the time is spent during the findall and from a quick benchmark it seems that my regex is slower (even if it's shorter and it compiles faster).
msg100485 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010-03-05 15:16
Interestingly, the code olivers is using was originally written by Martin v. Loewis:

http://www.velocityreviews.com/forums/t646421-unicode-regex-and-hindi-language.html

In response to a still open bug report on \w in the Python re module:

http://bugs.python.org/issue1693050
History
Date User Action Args
2022-04-11 14:56:58adminsetgithub: 52311
2010-03-05 15:16:21michael.foordsetmessages: + msg100485
2010-03-05 02:14:30ezio.melottisetmessages: + msg100447
2010-03-05 00:00:31ezio.melottisetstatus: open -> closed
resolution: not a bug
messages: + msg100441

stage: resolved
2010-03-04 23:37:27exarkunsetmessages: + msg100439
2010-03-04 23:33:57michael.foordsetmessages: + msg100438
2010-03-04 23:29:03vstinnersetmessages: + msg100437
2010-03-04 23:27:41vstinnersetnosy: + vstinner
messages: + msg100436
2010-03-04 23:25:33exarkunsetnosy: + exarkun
messages: + msg100435
2010-03-04 23:17:36ezio.melottisetpriority: normal
nosy: + ezio.melotti, mrabarnett

versions: + Python 2.6, - Python 2.5
2010-03-04 23:14:44oliverscreate