This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Use named constants internally in the re module
Type: enhancement Stage: resolved
Components: Regular Expressions Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: effbot, ezio.melotti, gvanrossum, mrabarnett, pitrou, python-dev, rhettinger, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2014-09-17 16:09 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
re_named_consts.patch serhiy.storchaka, 2014-09-17 16:09 review
re_opcode_enums.patch serhiy.storchaka, 2014-09-19 08:00
Messages (15)
msg227008 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-17 16:09
Regular expression parser parses a pattern to a tree, marking nodes by string identifiers. Regular expression compiler converts this three into plain list of integers. Node's identifiers are transformed to sequential integers. Resulting list is not human readable. Proposed patch converts string constants in the sre_constants module to named integer constants. These constants doesn't need converting to integers, because they are already integers, and when printed they looks human-friendly. Now intermediate result of regular expression compiler looks much more readable.

Example.

>>> import re, sre_compile, sre_parse
>>> sre_compile._code(sre_parse.parse('[a-z_][a-z_0-9]+', re.I), re.I)

Before patch:

[17, 4, 0, 2, 2147483647, 16, 7, 27, 97, 122, 19, 95, 0, 29, 16, 1, 2147483647, 16, 11, 10, 0, 67043328, 2147483648, 134217726, 0, 0, 0, 0, 0, 1, 1]

After patch:

[INFO, 4, 0, 2, MAXREPEAT, IN_IGNORE, 7, RANGE, 97, 122, LITERAL, 95, FAILURE, REPEAT_ONE, 16, 1, MAXREPEAT, IN_IGNORE, 11, CHARSET, 0, 67043328, 2147483648, 134217726, 0, 0, 0, 0, FAILURE, SUCCESS, SUCCESS]

This patch also affects debugging output when regular expression is compiled with re.DEBUG (identifiers are uppercased and MAXREPEAT is displayed instead of 2147483647 in repeat statements).

Besides debugging output these changes are invisible for ordinal user. They are needed only for developing and debugging the re module itself. The patch doesn't affect performance and almost not affects memory consumption.
msg227010 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-09-17 16:16
Love it!
msg227035 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 09:58
If there are no objections I'll commit the patch soon.
msg227049 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-09-18 15:03
Hm. Could you not use the new Enum class?
msg227051 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-18 16:14
Answering Guido's question about the Enum class. No, it is not appropriate here. It has too cumbersome repr (<OPCODES.IN_IGNORE: 16> instead of IN_IGNORE). Enum function syntax can't by used because it enumerates values from 1. We need three Enum subclasses for three groups of constants, and fourth class for MAXREPEAT, and fifth base abstract class. To fit the Enum class to our need we need more boilerplate code than to implement minimal needed functionality from scratch. And I'm not sure that this will not create circular dependency.
msg227059 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-09-18 18:26
I think you are too casual in rejecting a standard approach over a custom
clever hack. Making the values enums gives them a standard interface that
goes beyond what you implemented, and just the fact that we can say "these
are IntEnum instances" specifies a lot more of the semantics than pointing
to your class hack.

 It's easy to make a subclass of IntEnum that overrides __repr__ and
__str__. It's also easy enough to override the constructor to make the
values start at 0 (though it would be a nice feature if we could add a
keyword arg to the EnumMeta.__call__() definition to override the start
position). I don't really care that this is more code.

I don't see where there would be a circular dependency; the enum module
doesn't import the re module.

There is one thing that might be less convenient: defining an enum doesn't
automatically make the values globals. But wouldn't the code be better if
the values weren't globals?

Finally. You objected to adding __all__ in the code review. That too,
suggests a somewhat casual attitude. This code may be maintained by your
grandchildren. Give them something future-proof, please.

On Thu, Sep 18, 2014 at 9:14 AM, Serhiy Storchaka <report@bugs.python.org>
wrote:

>
> Serhiy Storchaka added the comment:
>
> Answering Guido's question about the Enum class. No, it is not appropriate
> here. It has too cumbersome repr (<OPCODES.IN_IGNORE: 16> instead of
> IN_IGNORE). Enum function syntax can't by used because it enumerates values
> from 1. We need three Enum subclasses for three groups of constants, and
> fourth class for MAXREPEAT, and fifth base abstract class. To fit the Enum
> class to our need we need more boilerplate code than to implement minimal
> needed functionality from scratch. And I'm not sure that this will not
> create circular dependency.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue22434>
> _______________________________________
>
msg227083 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-19 08:00
Here is a patch using enums. I still think enums are superfluous here. Advanced the Enum class features (pickling, access by name, type checking) are not needed - these constants don't leaked in outer word.

> I don't see where there would be a circular dependency; the enum module
> doesn't import the re module.

Right now there is no circular dependency. But the enum module imports collections which imports a lot of other modules, some of which import other modules. In future some of indirectly imported module can import the re module.

> There is one thing that might be less convenient: defining an enum doesn't
> automatically make the values globals. But wouldn't the code be better if
> the values weren't globals?

I afraid this will make the code of parser and compiler less readable and slower. In any case the sre_constants module itself is a namespace.
msg227214 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-21 12:12
I think I agree with Serhiy. Augmenting the import dependencies from the re module can have side effects. While the alternative (a custom Int subclass) isn't extremely pretty, it does the job.

NamedIntConstant should be private, though.
msg227218 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-09-21 15:48
OK, then if you want to review Serhiy's original patch that would be great.

On Sun, Sep 21, 2014 at 5:12 AM, Antoine Pitrou <report@bugs.python.org>
wrote:

>
> Antoine Pitrou added the comment:
>
> I think I agree with Serhiy. Augmenting the import dependencies from the
> re module can have side effects. While the alternative (a custom Int
> subclass) isn't extremely pretty, it does the job.
>
> NamedIntConstant should be private, though.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue22434>
> _______________________________________
>
msg227220 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-21 16:50
> NamedIntConstant should be private, though.

Agree. I'll add underscore to NamedIntConstant and makecode before committing 
(as in enum patch).
msg230865 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-08 15:39
Could you please make a review of any patch Antoine? This would help me to debug re engine. It doesn't matter which patch apply, with good chance all this will be changed before 3.5 release and may be not once.
msg230878 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-11-09 02:52
I reviewed re_named_consts.patch and it looks great (I especially like the removal of superfluous OPCODES dictionary lookups and improved repr for the integer codes).

Since the op codes are singletons, you can use identity tests instead of 
equality checks in sre_parse.py:

  -      if op == "in":
  +      if op == IN:

Also, I'll echo the suggestion to make NamedIntConstant private with a leading underscore.

Nice work.
msg230896 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-09 18:45
Thank you Raymond for the review.

> I especially like the removal of superfluous OPCODES dictionary lookups

The disadvantage is that now backporting patches to old branches is harder.

> Since the op codes are singletons, you can use identity tests instead of
> equality checks in sre_parse.py:

This is deliberate. There is small change that the opcodes could became an 
integers in future (after adding introspection to pattern object, or make it 
copyable, or allowing serialization of compiled codes). "is" instead of "==" 
can cause subtle bug. Note that "is" is used during parsing, but getwidth() is 
called from sre_compile.py.
msg230941 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-11-10 08:23
New changeset fc7dbba57869 by Serhiy Storchaka in branch 'default':
Issue #22434: Constants in sre_constants are now named constants (enum-like).
https://hg.python.org/cpython/rev/fc7dbba57869
msg231043 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-11 19:23
> Since the op codes are singletons, you can use identity tests instead of
> equality checks in sre_parse.py:

Please ignore my reply in previous message. Op codes are always tested for 
identity in sre_compile.py, so I have applied your suggestion in sre_parse.py 
(30a6c74ad87f).
History
Date User Action Args
2022-04-11 14:58:08adminsetgithub: 66624
2014-11-11 19:23:50serhiy.storchakasetmessages: + msg231043
2014-11-10 08:23:23python-devsetnosy: + python-dev
messages: + msg230941
2014-11-09 19:42:48serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2014-11-09 18:45:04serhiy.storchakasetmessages: + msg230896
2014-11-09 02:54:03rhettingersetnosy: + effbot
2014-11-09 02:52:26rhettingersetnosy: + rhettinger
messages: + msg230878
2014-11-08 15:39:15serhiy.storchakasetmessages: + msg230865
2014-09-21 16:50:45serhiy.storchakasetmessages: + msg227220
2014-09-21 15:48:37gvanrossumsetmessages: + msg227218
2014-09-21 12:12:36pitrousetmessages: + msg227214
2014-09-19 18:52:16terry.reedysetnosy: + terry.reedy
2014-09-19 08:00:14serhiy.storchakasetfiles: + re_opcode_enums.patch

messages: + msg227083
2014-09-18 18:26:11gvanrossumsetmessages: + msg227059
2014-09-18 16:14:49serhiy.storchakasetmessages: + msg227051
2014-09-18 15:03:26gvanrossumsetmessages: + msg227049
2014-09-18 09:58:40serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg227035
2014-09-17 16:16:55gvanrossumsetnosy: + gvanrossum
messages: + msg227010
2014-09-17 16:09:48serhiy.storchakacreate