classification
Title: re module: number of named groups is limited to 100 max
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, haypo, mrabarnett, pitrou, python-dev, r.david.murray, serhiy.storchaka, yselivanov
Priority: normal Keywords: patch

Created on 2014-09-18 17:39 by yselivanov, last changed 2014-09-29 20:15 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_maxgroups.patch serhiy.storchaka, 2014-09-18 20:36
re_maxgroups_dynamic.patch serhiy.storchaka, 2014-09-21 20:49 review
Messages (10)
msg227055 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2014-09-18 17:39
While writing a lexer for javascript language, I managed to hit the limit of named groups in one regexp, it's 100.  The check is in sre_compile.py:compile() function, and there is even an XXX comment on this.

Unfortunately, I'm not an expert in this module, so I'm not sure if this check can be lifted, or at least if the number can be bumped to 200 or 500 (why is 100 btw?)

Please share your thoughts.
msg227058 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-09-18 18:04
It is 100 to avoid a syntactic ambiguity between numbered groups and octal numbers, if I remember correctly.  I can't remember if that constraint still applies in python3, where the octal notation was made more strict in general.
msg227060 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2014-09-18 18:54
In the regex module, I borrowed the \g<...> escape from .sub's replacement string to provide an alternative way to refer to a group in a pattern, and that let me remove the limit.
msg227063 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 20:36
There is two reasons for this limitation. First reason is mentioned by David. There is no syntax to backreference a group with number > 99 (but there is a syntax for conditional groups and for substitutions). Second reason is that current implementation of regexp engine uses an array of constant size for groups.

Here is a patch which increases static limit to 1000 groups. It also allows to specify arbitrary group number in form of "(?P=number)". This is conformed to the syntax of conditional groups and for substitutions.
msg227064 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2014-09-18 20:53
Serhiy,

This is awesome!

Is is possible to split the patch in two, and commit the one that just increases the groups limit to 3.4 as well?

Thank you
msg227066 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 21:13
This is definitely not a bug fix. May be Matthew will commit it to the regex 
module and then you could use regex instead of re.
msg227237 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-21 20:49
Here is a patch which removes static limit. It is much more complicated than the first patch and I prefer first apply the first patch. Aren't 1000 groups enough for everyone?
msg227635 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2014-09-26 16:51
I'm fine with either one, Serhiy. The static one looks good to me.
msg227820 - (view) Author: Roundup Robot (python-dev) Date: 2014-09-29 19:50
New changeset 0b85ea4bd1af by Serhiy Storchaka in branch 'default':
Issue #22437: Number of capturing groups in regular expression is no longer
https://hg.python.org/cpython/rev/0b85ea4bd1af
msg227825 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-29 20:15
Thank you Antoine for your review.

To avoid discrepancy between re and regex (and other engines), I have committed only a part of dynamic patch, without adding support of backreferences with index over 99. It is unlikely to achieve this limit in hand written regular expression, and in generated regular expression you can use named groups.

I found that backreference syntax is one of most discrepant thing in regular expressions. There are at least 8 different variants (\N, \gN, \g<N>, \g{N}, \k<N>, \k'N', \k{N}, (?P=N)), and \g<N> in Perl have different meaning.
History
Date User Action Args
2014-09-29 20:15:38serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg227825

stage: patch review -> resolved
2014-09-29 19:50:49python-devsetnosy: + python-dev
messages: + msg227820
2014-09-26 16:51:05yselivanovsetmessages: + msg227635
2014-09-21 20:49:46serhiy.storchakasetfiles: + re_maxgroups_dynamic.patch

messages: + msg227237
2014-09-18 21:13:27serhiy.storchakasetmessages: + msg227066
2014-09-18 20:53:23yselivanovsetmessages: + msg227064
2014-09-18 20:36:43serhiy.storchakasetassignee: serhiy.storchaka
stage: patch review
versions: + Python 3.5
2014-09-18 20:36:02serhiy.storchakasetfiles: + re_maxgroups.patch
keywords: + patch
messages: + msg227063
2014-09-18 18:54:25mrabarnettsetmessages: + msg227060
2014-09-18 18:04:00r.david.murraysetnosy: + r.david.murray
messages: + msg227058
2014-09-18 17:39:42yselivanovcreate