Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"fnmatch" exponential execution time #84660

Closed
kleshni mannequin opened this issue May 3, 2020 · 13 comments
Closed

"fnmatch" exponential execution time #84660

kleshni mannequin opened this issue May 3, 2020 · 13 comments
Assignees
Labels
3.9 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@kleshni
Copy link
Mannequin

kleshni mannequin commented May 3, 2020

BPO 40480
Nosy @tim-one, @nedbat, @asottile
PRs
  • bpo-40480 "fnmatch" exponential execution time #19908
  • bpo-40480: restore ability to join fnmatch.translate() results #20049
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tim-one'
    closed_at = <Date 2020-05-06.02:31:54.930>
    created_at = <Date 2020-05-03.09:37:07.339>
    labels = ['library', '3.9', 'performance']
    title = '"fnmatch" exponential execution time'
    updated_at = <Date 2020-05-12.14:54:46.462>
    user = 'https://bugs.python.org/kleshni'

    bugs.python.org fields:

    activity = <Date 2020-05-12.14:54:46.462>
    actor = 'tim.peters'
    assignee = 'tim.peters'
    closed = True
    closed_date = <Date 2020-05-06.02:31:54.930>
    closer = 'tim.peters'
    components = ['Library (Lib)']
    creation = <Date 2020-05-03.09:37:07.339>
    creator = 'kleshni'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40480
    keywords = ['patch']
    message_count = 13.0
    messages = ['367963', '367982', '367983', '368002', '368075', '368222', '368645', '368655', '368683', '368686', '368695', '368714', '368734']
    nosy_count = 4.0
    nosy_names = ['tim.peters', 'nedbat', 'Anthony Sottile', 'kleshni']
    pr_nums = ['19908', '20049']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue40480'
    versions = ['Python 3.9']

    @kleshni
    Copy link
    Mannequin Author

    kleshni mannequin commented May 3, 2020

    Hello. The following code hangs:

    import fnmatch
    fnmatch.fnmatchcase("a" * 32, "*" * 16 + "b")

    Measurements show that its execution time rises exponentially with the number of asterisks. Bash and SQLite 3 process this glob instantly.

    This is because "fnmatchcase" generates a regular expression with repeated dots:

    (?s:.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*b)\\Z

    It's equivalent to:

    (?s:.*b)\\Z

    But works in exponential time. So the solution is to replace multiple asterisks with a single one, so the latter regexp is generated instead.

    @kleshni kleshni mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels May 3, 2020
    @kleshni
    Copy link
    Mannequin Author

    kleshni mannequin commented May 3, 2020

    So the solution is to replace multiple asterisks with a single one
    No, this is not a solution. This glob also hangs:

    fnmatch.fnmatchcase("a" * 32, "*a" * 16 + "b")

    And again, Bash and SQLite 3 work perfectly.

    @tim-one
    Copy link
    Member

    tim-one commented May 3, 2020

    Note that doctest has the same kind of potential problem with matching ellipsis (0 or more characters) in expected output blocks. Backtracking isn't needed at all to resolve patterns of that limited kind, but I don't think Python's re module supports enough gimmicks to disable backtracking.

    So instead doctest has its own

    _ellipsis_match()

    function to do it in worst-case linear time.

    @tim-one
    Copy link
    Member

    tim-one commented May 4, 2020

    "The trick" with this kind of pattern is that a * should consume as little as possible to allow the next fixed portion to match. Apply that at every stage, and it succeeds or it doesn't. Backtracking can't change that outcome. For, e.g., the shell pattern:

    a*bc*d*

    a regexp to do this without backtracking would be like this, IF Python's re engine supported "atomic groups" (but it doesn't):

    a(?>.*?bc)(?>.*?d).*\Z

    The same effect can be gotten in a more convoluted way, via positive lookahead assertions and backreferencing (which Python's re engine does support):

    a(?=(.*?bc))\1(?=(.*?d))\2.*\Z

    Assertions are "one and done": if the overall match fails, assertions don't try alternatives.

    So that's _a_ way to proceed. I'm unclear on that it's worth it, though. Stuff like "*a*a*a*a*a*a*" is just hard to swallow as a shell pattern that would occur in real life.

    @tim-one
    Copy link
    Member

    tim-one commented May 4, 2020

    Changed version to 3.9, because anything done would change the regexp generated, and fnmatch.translate()` makes that regexp visible.

    @tim-one tim-one added 3.9 only security fixes and removed 3.8 only security fixes labels May 4, 2020
    @tim-one
    Copy link
    Member

    tim-one commented May 6, 2020

    New changeset b9c46a2 by Tim Peters in branch 'master':
    bpo-40480 "fnmatch" exponential execution time (GH-19908)
    b9c46a2

    @tim-one tim-one closed this as completed May 6, 2020
    @tim-one tim-one self-assigned this May 6, 2020
    @tim-one tim-one closed this as completed May 6, 2020
    @tim-one tim-one self-assigned this May 6, 2020
    @nedbat
    Copy link
    Member

    nedbat commented May 11, 2020

    This change has caused a problem for coverage.py. The full details are here: nedbat/coveragepy#988 (comment)

    Coverage.py combines fnmatch-produced regexes by joining them with pipes into one larger regex. The \<g1> group in the regexes now makes that larger regex invalid.

    @tim-one
    Copy link
    Member

    tim-one commented May 11, 2020

    Ned, would it be possible to rewrite code of the form:

        if giant pasted regexp matches:
        
    to:
    
        if any(p matches for p in patterns):

    That should work under any version of Python.

    There's no guarantee that regexps _can_ be pasted together and still work, so I can't call this change "a bug". That pasting regexps together "worked" before was an implementation accident.

    I'd be happy to change it anyway, except I know of no way to use Python's re engine without backreferences that can avoid exponential-time behavior in some cases. In some other regexp engines, yes (e.g., as the code comments note, in those that support "atomic grouping"), but not in Python's. Nor does Python's re engine support reusing backreference names or numbers.

    So I know of no way to restore the ability to paste regexps together that wouldn't reintroduce the possiblity of exponential time failure :-(

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented May 11, 2020

    one way might be to give the groups "unique" names (perhaps hashing the input string?) ((this is what I attempted to do in a little bit of code which tried to "backport" (group)*+ and (group)++))

    @tim-one
    Copy link
    Member

    tim-one commented May 12, 2020

    I don't want something probabilistic. Fix it or don't ;-)

    One thing that would work, but at the cost of non-determinism: do the same as now, but obtain the number part of the group name by applying next() to a module-global private instance of itertools.count(). That will keep the numbers increasing "forever", and across calls. The point to using .count() is that it's atomic (i.e., won't repeat a number if multiple threads happen to be constructing regexps simultaneously).

    It's a darned silly amount of effort, though ;-)

    @tim-one
    Copy link
    Member

    tim-one commented May 12, 2020

    New changeset b1b4c79 by Tim Peters in branch 'master':
    bpo-40480: restore ability to join fnmatch.translate() results (GH-20049)
    b1b4c79

    @nedbat
    Copy link
    Member

    nedbat commented May 12, 2020

    Wow, thanks Tim. To be honest, I was coming around to your original point of view that it was too much to promise that these regexes could be combined the way I'm doing it. If we have to undo this latest change, I can live with it.

    @tim-one
    Copy link
    Member

    tim-one commented May 12, 2020

    Ned, I'm happy to do this. While the ability to join wasn't documented, it's not an unreasonable expectation. I'm not sure it's possible to fail _unless_ the regexps use named groups (and/or numbered backreferences) - and nobody in their right mind would expect regexps for such simple patterns to do such a thing ;-)

    So chances seem decent the regression your user stumbled into wouldn't be the only one to pop up. The fix was actually quite easy (and thanks to Anthony for nudging me in that direction!). The annoying part was writing a test given that the precise group names generated are no longer predictable :-(

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants