Issue 8064: Large regex handling very slow on Linux

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52311

classification

Title:	Large regex handling very slow on Linux
Type:	resource usage	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 2.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	exarkun, ezio.melotti, michael.foord, mrabarnett, olivers, vstinner
Priority:	normal	Keywords:

Created on 2010-03-04 23:14 by olivers, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
regextest.py	olivers, 2010-03-04 23:14

Messages (9)
msg100434 - (view)	Author: Oliver Sturm (olivers)	Date: 2010-03-04 23:14
The code in regextest.py (attached) uses a large regex to analyze a piece of text. I have tried this test program on two Macs, using the standard Python distributions. On a MacBook, 2.4 GHz dual core, Snow Leopard with Python 2.6.1, it takes 0.08 seconds On a MacPro, 2.something GHz 8 core, Leopard with Python 2.5.1, it takes 0.09 seconds Now I've also tried it on several Linux machines, all of them running Ubuntu. They are all extremely slow. The machine I've been testing with now is a 2.0 GHz dual core machine with Python 2.5.2. A test run of the program takes 97.5 (that's not a typo) seconds on this machine, 1200 times as long as on the Macs.
msg100435 - (view)	Author: Jean-Paul Calderone (exarkun) *	Date: 2010-03-04 23:25
I think it's likely that the test program does drastically different things on Linux than it does on OS X: Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import regextest >>> len(regextest.makew()) 93455 >>> Compare to: Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import regextest >>> len(regextest.makew()) 47672 >>> If I modify it to use 65535 (sys.maxunicode on OS X) and then run it on Linux, it completes quite quickly.
msg100436 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-03-04 23:27
Results on Linux (Debian Sid) with different Python versions: * Python 2.4.6: 14112.8 ms * Python 2.5.5: 14246.7 ms * Python 2.6.4+: 14753.4 ms * Python trunk (2.7a3+): 69.3 ms It looks like re engine was optimized in trunk :-) Note: I replaced stopwatch by time.time() and used 100 iterations instead of 500 iterations. Values are not important, only the ratio between the different results. -- exarkun> I think it's likely that the test program does drastically exarkun> different things on Linux than it does on OS X sys.maxunicode should be different on the two OS.
msg100437 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-03-04 23:29
Ooops, my benchmark was wrong. It looks like the result depends sys.maxunicode: $ python2.4 -c "import sys; print sys.maxunicode" 1114111 $ python2.5 -c "import sys; print sys.maxunicode" 1114111 $ python2.6 -c "import sys; print sys.maxunicode" 1114111 $ ./python -c "import sys; print sys.maxunicode" 65535 The last on (./python) is Python trunk.
msg100438 - (view)	Author: Michael Foord (michael.foord) *	Date: 2010-03-04 23:33
So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling?
msg100439 - (view)	Author: Jean-Paul Calderone (exarkun) *	Date: 2010-03-04 23:37
> So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling? No, but it's probably reasonable / unavoidable that a more complex regex should be some number of times slower than a simpler regex. On Linux, the regex being constructed is more complex. On OS X, it's simpler. The reason for the difference is that the Linux build is UCS4, but that's only because the unicode character width is being used as part of the function that constructs the regular expression. If you take this variable out of that function, so that it returns the same string regardless of the width of a unicode character, then performance evens out.
msg100441 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-03-05 00:00
A workaround could be using [^\W\d], but this includes some extra chars in the categories Pc, Nl, and No that maybe you don't want. Generate a list of chars in these 3 categories and add them in the regex should be cheaper though. Since this is not a bug of Python I'll close this issue.
msg100447 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-03-05 02:14
This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds): >>> s = u''.join(unichr(c) for c in range(sys.maxunicode)) >>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[%s_-]' % makew(), s, re.U)) >>> diff.remove('-') >>> re.findall(u'(?:[^\W\d%s]\|-)' % ''.join(diff), s, re.U) == re.findall(u'[%s_-]' % makew(), s, re.U) True (I don't like the way I included the '-' but I couldn't find anything better.) It looks however that most of the time is spent during the findall and from a quick benchmark it seems that my regex is slower (even if it's shorter and it compiles faster).
msg100485 - (view)	Author: Michael Foord (michael.foord) *	Date: 2010-03-05 15:16
Interestingly, the code olivers is using was originally written by Martin v. Loewis: http://www.velocityreviews.com/forums/t646421-unicode-regex-and-hindi-language.html In response to a still open bug report on \w in the Python re module: http://bugs.python.org/issue1693050

History
Date	User	Action	Args
2022-04-11 14:56:58	admin	set	github: 52311
2010-03-05 15:16:21	michael.foord	set	messages: + msg100485
2010-03-05 02:14:30	ezio.melotti	set	messages: + msg100447
2010-03-05 00:00:31	ezio.melotti	set	status: open -> closed resolution: not a bug messages: + msg100441 stage: resolved
2010-03-04 23:37:27	exarkun	set	messages: + msg100439
2010-03-04 23:33:57	michael.foord	set	messages: + msg100438
2010-03-04 23:29:03	vstinner	set	messages: + msg100437
2010-03-04 23:27:41	vstinner	set	nosy: + vstinner messages: + msg100436
2010-03-04 23:25:33	exarkun	set	nosy: + exarkun messages: + msg100435
2010-03-04 23:17:36	ezio.melotti	set	priority: normal nosy: + ezio.melotti, mrabarnett versions: + Python 2.6, - Python 2.5
2010-03-04 23:14:44	olivers	create