classification
Title: Loop in re (regular expression) processing
Type: resource usage Stage:
Components: Library (Lib), Regular Expressions Versions: Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Walter Farrell, ezio.melotti, gdr@garethrees.org, mrabarnett, vstinner
Priority: normal Keywords:

Created on 2016-11-14 16:24 by Walter Farrell, last changed 2016-11-15 10:03 by vstinner. This issue is now closed.

Messages (4)
msg280790 - (view) Author: Walter Farrell (Walter Farrell) Date: 2016-11-14 16:24
Given:
  pattern = r"(^|[^\\])<(pm [^ ]+( +|'[^']*'|\"[^\"]*\"|[^>]+)+)>"
  s = "<b>Bain, F. W.</b> <pm href 'Digit of the moon, and other"

import re
  m = re.search(pattern, s)

The re.search call seems to loop, never returning (or, at least never returning as long as I was willing to wait).

Note that with a ">" added to the end of s, it returns quickly with a match. Without the ">" it should fail, but instead seems to loop.

(If I use the regex module instead of re, it fails properly and quickly.)

Python 3.5.1, Windows 10:
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32
msg280813 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2016-11-14 20:55
This is a well-known gotcha with backtracking regexp implementations. The problem is that in the alternation "( +|'[^']*'|\"[^\"]*\"|[^>]+)" there are some characters (space, apostrophe, double quotes) that match multiple alternatives (for example a space matches both " +" and "[^>]+"). This causes the regexp engine to have to backtrack for each ambiguous character to try out the other alternatives, leading to runtime that's exponential in the number of ambiguous characters.

Linear behaviour can be restored if you make the alternation unambiguous, like this: ( +|'[^']*'|\"[^\"]*\"|[^>'\"]+)
msg280814 - (view) Author: Walter Farrell (Walter Farrell) Date: 2016-11-14 21:09
Thanks, Gareth. That does work.

Interesting that regex does still seem to work linearly with the original
version, but your version seems cleaner.

On Mon, Nov 14, 2016 at 3:55 PM, Gareth Rees <report@bugs.python.org> wrote:

>
> Gareth Rees added the comment:
>
> This is a well-known gotcha with backtracking regexp implementations. The
> problem is that in the alternation "( +|'[^']*'|\"[^\"]*\"|[^>]+)" there
> are some characters (space, apostrophe, double quotes) that match multiple
> alternatives (for example a space matches both " +" and "[^>]+"). This
> causes the regexp engine to have to backtrack for each ambiguous character
> to try out the other alternatives, leading to runtime that's exponential in
> the number of ambiguous characters.
>
> Linear behaviour can be restored if you make the alternation unambiguous,
> like this: ( +|'[^']*'|\"[^\"]*\"|[^>'\"]+)
>
> ----------
> nosy: +Gareth.Rees
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue28690>
> _______________________________________
>
msg280829 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 10:03
It's not really a bug, but more a trap of regular expressions. It seems like you fixed your issue, so I close it.
History
Date User Action Args
2016-11-15 10:03:49vstinnersetstatus: open -> closed

nosy: + vstinner
messages: + msg280829

resolution: not a bug
2016-11-14 21:09:42Walter Farrellsetmessages: + msg280814
2016-11-14 20:55:25gdr@garethrees.orgsetnosy: + gdr@garethrees.org
messages: + msg280813
2016-11-14 16:24:47Walter Farrellcreate