classification
Title: Search is to long with regex like ^(.+|dontmatch)*$
Type: Stage:
Components: Regular Expressions Versions: Python 2.4
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: niemeyer Nosy List: haypo, niemeyer, orivej, pitrou, rsc
Priority: normal Keywords:

Created on 2005-09-21 03:09 by haypo, last changed 2008-09-09 11:35 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
bug2.py haypo, 2005-09-21 03:09 Test file (bug2.py)
bug.c haypo, 2005-09-21 03:12 Test file in C language
Messages (6)
msg26345 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2005-09-21 03:09
Hi,

I don't know if it's a bug or a feature, but Python
looks very slow to process the regex ^(.+|D)*A$ for
string like "xxxxxxB" (add x to run python slower).

I found this bug when a try to match strings like "abc"
or "a\"b" or "\"abc"  or "" with the regex:
>>\"((?:[^\"]+|\\\")*)\"<<.

I know that ".+*" construction in not ... optimal, but
Python should optimize regex like (.+|something)*.

Haypo
msg26346 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2005-09-21 03:12
Logged In: YES 
user_id=365388

I try the same regex with Linux Regex library, and it looks
fast (but, can I compare C with Python ?) and not "buggy".

See bug.c file, and type gcc -o bug bug.c && time ./bug (I
get 0.000s for user ...).

Victor
msg26347 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2005-09-21 03:18
Logged In: YES 
user_id=365388

The "bug" only occurs when the regex doesn't match, else
it's very fast to match the string.

Haypo
msg26348 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2005-09-21 11:24
Logged In: YES 
user_id=365388

First fix: (a+|b)* can be replaced by (a|b)*. Other
optimizations : (a|b+)* and (a+|b+) can be replaced by (a|b)*.

For the case (a*|b)+ ... Wait, I'm thinking on the problem
:-) Hum, looks like (?:(a|b)+)?. Im' not sure.

This optimizatons should be done in compile() function.

Haypo
msg72846 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-09-09 11:30
It's not a bug, it's a feature. I wrote a library 
(http://hachoir.org/wiki/hachoir-regex) to "optimize" regex, so you 
can close this issue.
msg72848 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-09 11:35
Fair enough :)
History
Date User Action Args
2008-09-09 11:35:35pitrousetstatus: open -> closed
nosy: + pitrou
resolution: wont fix
messages: + msg72848
2008-09-09 11:30:09hayposetmessages: + msg72846
2008-04-24 21:10:58rscsetnosy: + rsc
2008-01-29 12:33:34orivejsetnosy: + orivej
2005-09-21 03:09:18haypocreate