This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ostkamp
Recipients ostkamp
Date 2007-09-13.08:00:55
SpamBayes Score 0.080608144
Marked as misclassified No
Message-id <1189670456.94.0.669298411185.issue1160@psf.upfronthosting.co.za>
In-reply-to
Content
Hello,

a medium size regular expression crashes Python 2.5.1 as follows:

Traceback (most recent call last):
  File "./regtest.py", line 14, in <module>
    m = rematch(pats)
  File "./regtest.py", line 12, in rematch
    return re.compile(pat).match
  File "/export/home/ostkamp/local/lib/python2.5/re.py", line 180, in
compile
    return _compile(pattern, flags)
  File "/export/home/ostkamp/local/lib/python2.5/re.py", line 231, in
_compile
    p = sre_compile.compile(pattern, flags)
  File "/export/home/ostkamp/local/lib/python2.5/sre_compile.py", line
530, in compile
    groupindex, indexgroup
OverflowError: regular expression code size limit exceeded


This is apparently caused by some code in Modules/_sre.c and
Modules/sre.h as follows:

        self->code[i] = (SRE_CODE) value;
        if ((unsigned long) self->code[i] != value) {
            PyErr_SetString(PyExc_OverflowError,
                            "regular expression code size limit exceeded");
            break;
        }

An 'unsigned int' value is unnecessarily squeezed into an 'unsigned
short' field defined in sre.h:

#ifdef Py_UNICODE_WIDE
#define SRE_CODE Py_UCS4
#else
#define SRE_CODE unsigned short
#endif

On all systems I'm working on (SuSE Linux SLES 9, Solaris 8 etc.) the
else case of the ifdef applies which chooses 'unsigned short'.

I don't understand the relationship between 'unicode' and what is
apparently the size of the regular expression stack here.

Some experiments have shown that changing the 'unsigned short' to
'unsigned long' and rebuilding Python fixes the problem.

Here is a test program to reproduce the error:

#!/usr/bin/env python
import re, random, sys
def randhexstring():
    return "".join(["%04x" % random.randint(0, 0xffff) for x in range(20)])
pats = [randhexstring() for x in range(1000)]
def rematch(pats):
    pat = '(?:%s)' % '|'.join(pats)
    return re.compile(pat).match
m = rematch(pats)
count = 0
for s in pats * 100:
    if m(s):
        count += 1
print count



Regards

Guido
History
Date User Action Args
2007-09-13 08:00:57ostkampsetspambayes_score: 0.0806081 -> 0.080608144
recipients: + ostkamp
2007-09-13 08:00:56ostkampsetspambayes_score: 0.0806081 -> 0.0806081
messageid: <1189670456.94.0.669298411185.issue1160@psf.upfronthosting.co.za>
2007-09-13 08:00:56ostkamplinkissue1160 messages
2007-09-13 08:00:55ostkampcreate