This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tim.peters
Recipients ezio.melotti, lpd, mrabarnett, tim.peters
Date 2012-11-27.00:42:17
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1353976938.17.0.842131609007.issue16563@psf.upfronthosting.co.za>
In-reply-to
Content
There's actually enormous backtracking here.  Try this much shorter regexp and you'll see much the same behavior:

re_utf8 = r'^([\x00-\x7f]+)*$'

That's the original re_utf8 with all but the first alternative removed.

Looks like passing s[0:34] "works" because it eliminates the trailing \x8d that prevents the regexp from matching the whole string.  Because the regexp cannot match the whole string, it takes a very long time to try all the futile combinations implied by the nested quantifiers.  As the much simpler re_utf8 above shows, it's not the alternatives in the regexp that matter here, it's the nested quantifiers.
History
Date User Action Args
2012-11-27 00:42:18tim.peterssetrecipients: + tim.peters, lpd, ezio.melotti, mrabarnett
2012-11-27 00:42:18tim.peterssetmessageid: <1353976938.17.0.842131609007.issue16563@psf.upfronthosting.co.za>
2012-11-27 00:42:18tim.peterslinkissue16563 messages
2012-11-27 00:42:17tim.peterscreate