This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: RE '*.?' cores if len of found string exceeds 10000
Type: Stage:
Components: Regular Expressions Versions: Python 2.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: effbot Nosy List: effbot, josiahcarlson, rwhent
Priority: critical Keywords:

Created on 2004-10-26 12:55 by rwhent, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg22865 - (view) Author: Rob (rwhent) Date: 2004-10-26 12:55
Whilst parsing some extremely long strings I found that the
re.match causes segmentation faults on Solaris 2.8
when strings being matched contain '*.?' and the
contents of the regex which matches this part of the
regex exceeds 10000 chars (actually it seemed to be
exactly at 8192 chars)

This is the regex used:

    if re.match('^.*?\[\s*[A-Za-z_0-9]+\s*\].*',string): 

This regex looks for '[alphaNum_]' present in a large
string

When it failed the string was 8192 chars long with no
matching '[alphaNum_]' present. If I reduce the length
of the string below 8192 it works ok.

This is a major issue to my application as some string
to be parsed are very large. I saw some discussion on
another bulletin board with a similar issue

msg22866 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2004-10-26 13:20
Logged In: YES 
user_id=38376

The max recursion limit problem in the re module is well-known.  
Until this limitation in the implementation is removed, to work 
around it check

http://www.python.org/dev/doc/devel/lib/module-re.html
http://python/org/sf/493252
msg22867 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2004-10-26 13:24
Logged In: YES 
user_id=38376

btw, if you're searching for things, why not use the "search" 
method?

if re.search('\[\s*[A-Za-z_0-9]+\s*\]', string):

(also, "[A-Za-z_0-9]" is better spelled "\w")
msg22868 - (view) Author: Josiah Carlson (josiahcarlson) * (Python triager) Date: 2004-10-30 15:44
Logged In: YES 
user_id=341410

In the case of this particular search, you could write your
own little searcher.  The following could likely be done
better, but this is a quick 5-minute job that won't core on
you unless something is really wrong with Python, and may be
a reasonable stopgap until someone re-does the regular
expression library.

import string

def find_thing(s):
    sp = 0
    d = dict.fromkeys(list(string.letters+string.digits+'_'))
    while sp < len(s):
        start = None
        for i in xrange(sp, len(s)):
            if s[i] == '[':
                start = i
                break
        if start is None:
            return
        for i in xrange(start+1, len(s)):
            if s[i] in d:
                continue
            elif s[i] == ']':
                return s[start:i+1]
            else:
                sp = i
                break

It returns None on failure to find, and the string otherwise.
msg22869 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2005-02-14 11:35
Logged In: YES 
user_id=38376

closing, due to lack of feedback.  suggested workarounds
should solve the problem.
History
Date User Action Args
2022-04-11 14:56:07adminsetgithub: 41082
2004-10-26 12:55:58rwhentcreate