Issue43686
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2021-04-01 03:11 by alegrigoriev, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (3) | |||
---|---|---|---|
msg389949 - (view) | Author: Alexander Grigoriev (alegrigoriev) | Date: 2021-04-01 03:11 | |
Certain patterns and input strings cause re.match to take exponentially longer time. This can be expected because of recursive nature of matching some patterns, but it can take surprisingly long time with some combination of a pattern and input data of moderate length. For example: import re import time t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_ {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_ {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_D {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_D {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DI {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DI {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DIS {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DIS {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISP {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISP {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPL {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPL {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPLA {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPLA {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) t1 = time.monotonic() re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPLAY {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n') t2 = time.monotonic() print(r"re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPLAY {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): %s s" % (t2-t1)) Does: re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_ {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 0.2190000000409782 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_D {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 0.4529999999795109 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DI {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 0.9529999999795109 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DIS {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 1.797000000020489 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISP {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 3.6090000001713634 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPL {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 7.125 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPLA {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 14.890999999828637 s re.match(rb'((?: |\t)*)((?:.*?(?:\\(?: |\t)+)?)*)(\s*)$', b'#define EVENT_DISPLAY {\t\\\r\n\tif {\t\\\r\n\t\tmnt_disp;\t\\\r\n\t} }\r\n'): 29.688000000081956 s Perhaps re.match needs to be optimized and/or rewritten in low level language (C) |
|||
msg389953 - (view) | Author: Dennis Sweeney (Dennis Sweeney) * | Date: 2021-04-01 04:12 | |
It's well-known that regular expressions can take exponential time. You can try searching this bug tracker for "re exponential". Common suggestions are to try a third-party module, or to write better regexes where possible. Note that the important bits of the re module are already implemented in C: https://github.com/python/cpython/blob/master/Modules/_sre.c |
|||
msg390809 - (view) | Author: Gregory P. Smith (gregory.p.smith) * | Date: 2021-04-12 01:45 | |
Try something like https://pypi.org/project/pyre2/ or _maybe_ https://pypi.org/project/regex/ (I didn't think that one tried to do a breadth first approach though so maybe not) I'm marking this a duplicate of the age old issue around degenerate patterns in re. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:43 | admin | set | github: 87852 |
2021-04-12 01:45:43 | gregory.p.smith | set | status: open -> closed superseder: the re module can perform poorly: O(2**n) versus O(n**2) nosy: + gregory.p.smith messages: + msg390809 resolution: duplicate stage: resolved |
2021-04-01 04:12:25 | Dennis Sweeney | set | nosy:
+ Dennis Sweeney messages: + msg389953 |
2021-04-01 03:11:36 | alegrigoriev | create |