Message 29011 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients
Date	2006-07-02.21:40:54
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=31435 Since this isn't going to change, I'm closing this. I don't know exactly what your regexp is intended to match, but I expect this will help speed it enormously: inside a group, one of the alternatives is the negated character class (NCC): [^{}%] One of the other alternatives starts with "{" and another with "%". That's very good, because those three alternatives are mutually exclusive based on just the current character in the target string. However, yet another alternative starts with a backslash, and it's thus ambiguous whether the backslash should be matched by that alternative or by the NCC. Because this is a backtracking engine, and the NCC is the first alternative, it tries the NCC first and won't try the backslash alternative unless it's impossible to find a match having tried the NCC. That can cause exponential-time failing-match behavior all by itself. If it's the case that a backslash in this context is _always_ supposed to match the \\. alternative, then adding a backslash to the NCC removes the ambiguity and greatly speeds (at least) failing matches: [^{}%\\] Then which alternative is supposed to match is entirely determined by the current character in the target string, so when backtracking on failure all other alternatives fail at once, and backtracking continues with at worst insignificant delay. Adding a backslash to the "inner" NCC helps a little, but adding one to the "outer" NCC too is very effective: n=0: 0:00:00 n=1: 0:00:00 n=2: 0:00:00 ... n=97: 0:00:00 n=98: 0:00:00 n=99: 0:00:00 See Friedl's book for much more along these lines.

Logged In: YES 
user_id=31435

Since this isn't going to change, I'm closing this.

I don't know exactly what your regexp is intended to match,
but I expect this will help speed it enormously:  inside a
group, one of the alternatives is the negated character
class (NCC):

[^{}%]

One of the other alternatives starts with "{" and another
with "%".  That's very good, because those three
alternatives are mutually exclusive based on just the
current character in the target string.

However, yet another alternative starts with a backslash,
and it's thus ambiguous whether the backslash should be
matched by that alternative or by the NCC.  Because this is
a backtracking engine, and the NCC is the first alternative,
it tries the NCC first and won't try the backslash
alternative unless it's impossible to find a match having
tried the NCC.  That can cause exponential-time
failing-match behavior all by itself.

If it's the case that a backslash in this context is
_always_ supposed to match the

\\.

alternative, then adding a backslash to the NCC removes the
ambiguity and greatly speeds (at least) failing matches:

[^{}%\\]

Then which alternative is supposed to match is entirely
determined by the current character in the target string, so
when backtracking on failure all other alternatives fail at
once, and backtracking continues with at worst insignificant
delay.

Adding a backslash to the "inner" NCC helps a little, but
adding one to the "outer" NCC too is very effective:

n=0: 0:00:00
n=1: 0:00:00
n=2: 0:00:00
...
n=97: 0:00:00
n=98: 0:00:00
n=99: 0:00:00

See Friedl's book for much more along these lines.

History
Date	User	Action	Args
2007-08-23 14:40:58	admin	link	issue1515829 messages
2007-08-23 14:40:58	admin	create