Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expressions with 0 to 65536 repetitions raises OverflowError #57378

Closed
techmaurice mannequin opened this issue Oct 13, 2011 · 28 comments
Closed

Regular expressions with 0 to 65536 repetitions raises OverflowError #57378

techmaurice mannequin opened this issue Oct 13, 2011 · 28 comments
Assignees
Labels
extension-modules C modules in the Modules dir release-blocker stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error

Comments

@techmaurice
Copy link
Mannequin

techmaurice mannequin commented Oct 13, 2011

BPO 13169
Nosy @birkenfeld, @vstinner, @larryhastings, @benjaminp, @ezio-melotti, @serhiy-storchaka, @mcgfeller
Files
  • re_maxrepeat.patch
  • re_maxrepeat2.patch
  • re_maxrepeat3.patch
  • re_maxrepeat4-2.7.patch
  • re_maxrepeat4-3.2.patch
  • re_maxrepeat4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-04-04.11:15:43.211>
    created_at = <Date 2011-10-13.16:30:27.230>
    labels = ['extension-modules', 'expert-regex', 'type-bug', 'library', 'release-blocker']
    title = 'Regular expressions with 0 to 65536 repetitions raises OverflowError'
    updated_at = <Date 2013-04-04.11:15:43.209>
    user = 'https://bugs.python.org/techmaurice'

    bugs.python.org fields:

    activity = <Date 2013-04-04.11:15:43.209>
    actor = 'georg.brandl'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-04-04.11:15:43.211>
    closer = 'georg.brandl'
    components = ['Extension Modules', 'Library (Lib)', 'Regular Expressions']
    creation = <Date 2011-10-13.16:30:27.230>
    creator = 'techmaurice'
    dependencies = []
    files = ['28808', '28810', '28814', '28919', '28920', '28921']
    hgrepos = []
    issue_num = 13169
    keywords = ['patch']
    message_count = 28.0
    messages = ['145469', '145471', '145475', '145506', '145547', '152412', '154625', '154653', '180499', '180505', '180516', '180521', '180543', '181026', '182224', '182226', '182290', '182307', '182308', '186013', '186018', '186020', '186021', '186022', '186023', '186024', '186027', '186028']
    nosy_count = 11.0
    nosy_names = ['georg.brandl', 'vstinner', 'larry', 'benjamin.peterson', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'techmaurice', 'serhiy.storchaka', 'Martin.Gfeller']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13169'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

    @techmaurice
    Copy link
    Mannequin Author

    techmaurice mannequin commented Oct 13, 2011

    Regular expressions with 0 to 65536 repetitions and above makes Python crash with a "OverflowError: regular expression code size limit exceeded" exception.
    65535 repetitions do not raise this issue.

    Tested and confirmed this with versions 2.7.1 and 3.2.2.

    C:\Python27>python.exe
    Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> re.search('(?s)\A.{0,65535}test', 'test')
    <_sre.SRE_Match object at 0x00B4E4B8>
    >>> re.search('(?s)\A.{0,65536}test', 'test')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python27\lib\re.py", line 142, in search
        return _compile(pattern, flags).search(string)
      File "C:\Python27\lib\re.py", line 243, in _compile
        p = sre_compile.compile(pattern, flags)
      File "C:\Python27\lib\sre_compile.py", line 523, in compile
        groupindex, indexgroup
    OverflowError: regular expression code size limit exceeded
    >>>
    
    C:\Python32>python.exe
    Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> re.search('(?s)\A.{0,65535}test', 'test')
    <_sre.SRE_Match object at 0x00A6F250>
    >>> re.search('(?s)\A.{0,65536}test', 'test')
    Traceback (most recent call last):
      File "C:\Python32\lib\functools.py", line 176, in wrapper
        result = cache[key]
    KeyError: (<class 'str'>, '(?s)\\A.{0,65536}test', 0)
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python32\lib\re.py", line 158, in search
        return _compile(pattern, flags).search(string)
      File "C:\Python32\lib\re.py", line 255, in _compile
        return _compile_typed(type(pattern), pattern, flags)
      File "C:\Python32\lib\functools.py", line 180, in wrapper
        result = user_function(*args, **kwds)
      File "C:\Python32\lib\re.py", line 267, in _compile_typed
        return sre_compile.compile(pattern, flags)
      File "C:\Python32\lib\sre_compile.py", line 514, in compile
        groupindex, indexgroup
    OverflowError: regular expression code size limit exceeded
    >>>

    @techmaurice techmaurice mannequin added type-crash A hard crash of the interpreter, possibly with a core dump stdlib Python modules in the Lib dir labels Oct 13, 2011
    @briancurtin
    Copy link
    Member

    I might be missing something, but what's the issue? 65535 is the limit, and doing 65536 gives a clear overflow exception (no crash).

    @briancurtin briancurtin added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Oct 13, 2011
    @briancurtin briancurtin changed the title Regular expressions with 0 to 65536 repetitions and above makes Python crash Regular expressions with 0 to 65536 repetitions raises OverflowError Oct 13, 2011
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Oct 13, 2011

    The quantifiers use 65535 to represent no upper limit, so ".{0,65535}" is equivalent to ".*".

    For example:

    >>> re.match(".*", "x" * 100000).span()
    (0, 100000)
    >>> re.match(".{0,65535}", "x" * 100000).span()
    (0, 100000)

    but:

    >>> re.match(".{0,65534}", "x" * 100000).span()
    (0, 65534)

    @techmaurice
    Copy link
    Mannequin Author

    techmaurice mannequin commented Oct 14, 2011

    So if I understand correctly, the maximum of 65535 repetitions is by design?

    Have tried a workaround by repeating the repetitions by placing it inside a capturing group, which is perfectly legal with Perl regular expressions:

    $mystring = "test";
    if($mystring =~ m/^(.{0,32766}){0,3}test/s) { print "Yes\n"; }
    (32766 being the max repetitions in Perl)

    Unfortunately, in Python this does not work and raises a "nothing to repeat" sre_constants error:
    re.search('(?s)\A(.{0,65535}){0,3}test', 'test')

    This, however works, which yields 65536 repetitions of DOTALL:
    re.search('(?s)\A.{0,65535}.{0,1}test', 'test')

    In the end this solves my problem sort or less, but requires extra logic in my script and complicates stuff unnecessary.

    A suggestion might be to make repetitions of repeats possible?

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Oct 14, 2011

    The limit is an implementation detail. The pattern is compiled into codes which are then interpreted, and it just happens that the codes are (usually) 16 bits, giving a range of 0..65535, but it uses 65535 to represent no limit and doesn't warn if you actually write 65535.

    There's an alternative regex implementation here:

    http://pypi.python.org/pypi/regex

    @vstinner
    Copy link
    Member

    Issue bpo-13914 has been marked as a duplicate of this issue.

    @ezio-melotti
    Copy link
    Member

    Matthew, do you think this should be documented somewhere or that the behavior should be changed (e.g. raising a warning when 65535 is used)?
    If not I'll just close the issue.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Feb 29, 2012

    Ideally, it should raise an exception (or a warning) because the behaviour is unexpected.

    @serhiy-storchaka
    Copy link
    Member

    Now RuntimeError is raised in this case.

    Here is a patch, which:

    1. Increases the limit of repeat numbers to 4G (now SRE_CODE at least 32-bit).
    2. Raises re.error exception if this limit is exceeded.
    3. Fixes some minor related things.

    @serhiy-storchaka serhiy-storchaka added extension-modules C modules in the Modules dir topic-regex labels Jan 23, 2013
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Jan 24, 2013

    IMHO, I don't think that MAXREPEAT should be defined in sre_constants.py _and_ SRE_MAXREPEAT defined in sre_constants.h. (In the latter case, why is it in decimal?)

    I think that it should be defined in one place, namely sre_constants.h, perhaps as:

    #define SRE_MAXREPEAT ~(SRE_CODE)0

    and then imported into sre_constants.py.

    That'll reduce the chance of an inadvertent mismatch, and it's the C code that's imposing the limit to the number of repeats, not the Python code.

    @serhiy-storchaka
    Copy link
    Member

    (In the latter case, why is it in decimal?)

    Because SRE_MAXREPEAT is generated (as all sre_constants.h) from
    sre_constants.py (note changes at the end of sre_constants.py).

    I agree, that SRE_MAXREPEAT is imposed by the C code limitation and it will be
    better to defined it in C. But we can't just import C's define into Python. This
    requires more code.

    @serhiy-storchaka
    Copy link
    Member

    Patch updated for addressing Ezio's and Matthew's comments. MAXREPEAT now defined in the C code. It lowered to 2G on 32-bit platform to fit repetition numbers into Py_ssize_t. The condition for raising of an exception now more complex: if the repetition number overflows Py_ssize_t it means the same as an infinity bound and in this case an exception is not raised (i.e. it never raised on 32-bit platform). Tests added.

    @serhiy-storchaka
    Copy link
    Member

    Patch updated for addressing Ezio's comments. Tests simplified and optimized a little as Ezio suggested. Added a test for implementation dependent behavior (I hope it will gone away at some day).

    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 31, 2013
    @serhiy-storchaka
    Copy link
    Member

    Here are patches for 2.7, 3.2 and updated patch for 3.3+
    (test_repeat_minmax_overflow_maxrepeat is changed).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 16, 2013

    New changeset c1b3d25882ca by Serhiy Storchaka in branch '2.7':
    Issue bpo-13169: The maximal repetition number in a regular expression has been
    http://hg.python.org/cpython/rev/c1b3d25882ca

    New changeset 472a7c652cbd by Serhiy Storchaka in branch '3.2':
    Issue bpo-13169: The maximal repetition number in a regular expression has been
    http://hg.python.org/cpython/rev/472a7c652cbd

    New changeset b78c321ee9a5 by Serhiy Storchaka in branch '3.3':
    Issue bpo-13169: The maximal repetition number in a regular expression has been
    http://hg.python.org/cpython/rev/b78c321ee9a5

    New changeset ca0307905cd7 by Serhiy Storchaka in branch 'default':
    Issue bpo-13169: The maximal repetition number in a regular expression has been
    http://hg.python.org/cpython/rev/ca0307905cd7

    @serhiy-storchaka
    Copy link
    Member

    I have committed simplified patches. They don't change an exception type from OverflowError to re.error (but an error message now is more helpful) and don't made the code clever enough to not raise an exception when a repetition number is exceeded sys.maxsize.

    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented Feb 17, 2013

    Some third-party modules (e.g. epydoc) refer to sre_constants.MAXREPEAT.
    Please add 'from _sre import MAXREPEAT' to Lib/sre_constants.py for compatibility.

    @Arfrever Arfrever mannequin reopened this Feb 17, 2013
    @serhiy-storchaka
    Copy link
    Member

    Thank you for report, Arfrever. I'll see how epydoc uses MAXREPEAT. Maybe it requires larger changes.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 18, 2013

    New changeset a80ea934da9a by Serhiy Storchaka in branch '2.7':
    Fix issue bpo-13169: Reimport MAXREPEAT into sre_constants.py.
    http://hg.python.org/cpython/rev/a80ea934da9a

    New changeset a6231ed7bff4 by Serhiy Storchaka in branch '3.2':
    Fix issue bpo-13169: Reimport MAXREPEAT into sre_constants.py.
    http://hg.python.org/cpython/rev/a6231ed7bff4

    New changeset 88c04657c9f1 by Serhiy Storchaka in branch '3.3':
    Fix issue bpo-13169: Reimport MAXREPEAT into sre_constants.py.
    http://hg.python.org/cpython/rev/88c04657c9f1

    New changeset 3dd5be5c4794 by Serhiy Storchaka in branch 'default':
    Fix issue bpo-13169: Reimport MAXREPEAT into sre_constants.py.
    http://hg.python.org/cpython/rev/3dd5be5c4794

    @mcgfeller
    Copy link
    Mannequin

    mcgfeller mannequin commented Apr 4, 2013

    I see (under Windows) the same symptoms as reported for Debian under http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=704084.

    Python refuses to start.

    2.7.4.rc1 Windows 32-bit.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 4, 2013

    "Python refuses to start. 2.7.4.rc1 Windows 32-bit."

    Oh oh. I reopen the issue and set its priority to release blocker.

    @birkenfeld
    Copy link
    Member

    "Python refuses to start." is not a very good description.

    • What script are you running/module are you importing?
    • What is the traceback/error message?

    @mcgfeller
    Copy link
    Mannequin

    mcgfeller mannequin commented Apr 4, 2013

    @georg, the referenced Debian issue (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=704084) already contains the stack.

    @birkenfeld
    Copy link
    Member

    And this happens when you simply start Python, not executing any code?

    Can you start with "python -S", then do "import _sre", and see if it has a _sre.MAXREPEAT attribute?

    @ezio-melotti
    Copy link
    Member

    IIRC a few days ago I've seen a similar issue and the cause was that they did something wrong while porting the rc to Debian, but I don't remember the details. If I'm not mistaken they also fixed it shortly after.

    @birkenfeld
    Copy link
    Member

    Just tested with 2.7.4rc1 32bit on Windows 7; no problem here.

    I suspect your 2.7.4rc1 install picks up a python27.dll from an earlier version.

    @mcgfeller
    Copy link
    Mannequin

    mcgfeller mannequin commented Apr 4, 2013

    Sorry for passing on my confusion, and thanks for your help!

    There was indeed an old python.dll lying in one of the places Windows likes to put DLLs. Deleting it resolved the problem.

    Thanks again and sorry to use your valuable time.
    Best regards, Martin

    @birkenfeld
    Copy link
    Member

    Thanks for the confirmation!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir release-blocker stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants