classification
Title: Check for signals during regular expression matches
Type: Stage:
Components: Interpreter Core Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: facundobatista Nosy List: effbot, facundobatista, gvanrossum, joshhoyt, loewis, schmir
Priority: normal Keywords: patch

Created on 2003-11-21 06:29 by joshhoyt, last changed 2008-02-05 04:13 by gvanrossum. This issue is now closed.

Files
File name Uploaded Description Edit
_sre.c.patch joshhoyt, 2003-11-21 06:29 Allow signal handlers to run during _sre searches
patch schmir, 2007-11-02 21:35
patch.speedup schmir, 2007-11-02 22:04
sre_exception.diff facundobatista, 2008-01-04 18:06
Messages (12)
msg44910 - (view) Author: Josh Hoyt (joshhoyt) Date: 2003-11-21 06:29
This patch adds a call to PyErr_CheckSignals to
SRE_MATCH so that signal handlers can be invoked during
long regular expression matches. It also adds a new
error return value indicating that an exception
occurred in a signal handler during the match, allowing
exceptions in the signal handler to propagate up to the
main loop.

Rationale:

Regular expressions can run away inside of the C code.
There is no way for Python code to stop the C code from
running away, so we attempted to use setrlimit to
interrupt the process when the CPU usage exceeded a limit.

When the signal was received, the signal function was
triggered. The sre code does not allow the main loop to
run, so the triggered handlers were not called until
the regular expression finished, if ever. This
behaviour makes the interruption by the signal useless
for the purposes of constraining the running time of
regular expression matches.

I am unsure whether the PyErr_CheckSignals is
lightweight enough to be called inside of the for loop
in SRE_MATCH, so I took the conservative approach and
only checked on recursion or match invocation. I
believe that the performance hit from this check would
not be prohibitive inside of the loop, since
PyErr_CheckSignals does very little work unless there
is a signal to handle.
msg44911 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-11-22 15:14
Logged In: YES 
user_id=21627

Can you give an example for a SRE matching that is so slow
that the user may press Ctrl-C?
msg44912 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-11-13 23:28
Logged In: YES 
user_id=21627

Fredrik, what do you think about this patch?
msg57056 - (view) Author: Ralf Schmitt (schmir) Date: 2007-11-02 16:51
here is an example (from http://swtch.com/~rsc/regexp/regexp1.html)

python -c 'import re; num=25; r=re.compile("a?"*num+"a"*num);
r.match("a"*num)'

At work I have seen a real world case of a regular expression which ran
for minutes rendering the application unresponsive. We would have been
glad to be able to interrupt the regular expression engine and getting a
traceback.
msg57067 - (view) Author: Ralf Schmitt (schmir) Date: 2007-11-02 21:35
I'm attaching a working patch against 2.5.1 and a short test program.


#! /usr/bin/env python

import signal
import re
import time


def main():
    num=28 # need more than 60s on a 2.4Ghz core 2
    r=re.compile("a?"*num+"a"*num)

    signal.signal(signal.SIGALRM, signal.default_int_handler)
    signal.alarm(1)
    stime = time.time()
    try:
        r.match("a"*num)
    except KeyboardInterrupt:
        assert time.time()-stime<3
    else:
        raise RuntimeError("no keyboard interrupt")

if __name__=='__main__':
    main()
msg57068 - (view) Author: Ralf Schmitt (schmir) Date: 2007-11-02 22:04
hm. just noticed that calling PyErr_CheckSignals slows down regular
expression matching considerably (50%).

I'm adding another patch, which only checks every 4096th iteration for
signals.
msg59246 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-01-04 18:06
Couldn't apply cleanly the patch, as it appears to be a diff in other
format.

Anyway, applied it by hand, and now I attach the correct svn diff.

The test cases run ok with this change, and the problem is solved.

Regarding the delay introduced, I tested it with:

  $ ./python timeit.py -s "import re;r=re.compile('a?a?a?a?a?aaaaa')"
"r.match('aaaaa')"

Trunk:
  100000 loops, best of 3: 5.4 usec per loop
  100000 loops, best of 3: 5.32 usec per loop
  100000 loops, best of 3: 5.41 usec per loop

Patch applied:
  100000 loops, best of 3: 7.28 usec per loop
  100000 loops, best of 3: 6.79 usec per loop
  100000 loops, best of 3: 7.00 usec per loop

I don't like that. Anyway, I do NOT trust for timing the system where
I'm making the timing, so you may get different results.

Suggestions?
msg59247 - (view) Author: Ralf Schmitt (schmir) Date: 2008-01-04 18:48
./python Lib/timeit.py -n 1000000 -s "import
re;r=re.compile('a?a?a?a?a?aaaaa')" "r.match('aaaaa')" gives me for

Trunk:
1000000 loops, best of 3: 3.02 usec per loop
1000000 loops, best of 3: 2.99 usec per loop
1000000 loops, best of 3: 3.01 usec per loop

Patched:
1000000 loops, best of 3: 3.04 usec per loop
1000000 loops, best of 3: 3.04 usec per loop
1000000 loops, best of 3: 3.14 usec per loop

which would be ok, I guess.

(This is on a 64bit debian testing with gcc 4.2.3).

Can you test with the following:

if ((0 == (sigcount & 0xffffffff)) && PyErr_CheckSignals())

(i.e. the code will (nearly) not even call PyErr_CheckSignals).

I guess this is some c compiler optimization issue (seems like mine does
a better job at optimizing :)
msg59267 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-04 23:11
Mind if I assign this to Facundo? Facundo, if you wish to pass this on,
just unassign it.
msg59600 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-01-09 14:22
Retried it in a platform where I trust timing, and it proved ok.

So, problem solved, no performance impact, all tests pass ok. Commited
in r59862.

Thank you all!
msg61925 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-31 20:07
I think this is worth backporting to 2.5.2.  This and r60054 are the
*only* differences between _sre.c in 2.5.2 and 2.6.
msg62062 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-02-05 04:13
Backported to 2.5.2 as r60576.  (The other deltas are not backported.)
History
Date User Action Args
2008-02-05 04:13:26gvanrossumsetmessages: + msg62062
2008-01-31 20:07:17gvanrossumsetmessages: + msg61925
2008-01-10 01:55:51christian.heimeslinkissue1448325 superseder
2008-01-09 14:22:46facundobatistasetstatus: open -> closed
resolution: fixed
messages: + msg59600
2008-01-04 23:11:28gvanrossumsetassignee: effbot -> facundobatista
messages: + msg59267
nosy: + gvanrossum
2008-01-04 18:48:22schmirsetmessages: + msg59247
2008-01-04 18:06:30facundobatistasetfiles: + sre_exception.diff
nosy: + facundobatista
type: enhancement ->
messages: + msg59246
versions: + Python 2.4, - Python 2.6, Python 2.5
2008-01-04 15:49:29christian.heimessettype: enhancement
versions: + Python 2.6, Python 2.5, - Python 2.4
2007-11-02 22:04:29schmirsetfiles: + patch.speedup
messages: + msg57068
2007-11-02 21:35:40schmirsetfiles: + patch
messages: + msg57067
2007-11-02 16:51:27schmirsetnosy: + schmir
messages: + msg57056
2003-11-21 06:29:21joshhoytcreate