This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Reorganize the re module sources
Type: Stage: patch review
Components: Library (Lib), Regular Expressions Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Anthony Sottile, dom1310df, ezio.melotti, gvanrossum, malin, mrabarnett, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2022-03-29 15:54 by serhiy.storchaka, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 32177 merged serhiy.storchaka, 2022-03-29 16:15
PR 32188 malin, 2022-03-30 08:52
PR 32290 merged serhiy.storchaka, 2022-04-03 17:25
PR 32298 merged serhiy.storchaka, 2022-04-04 07:33
Messages (26)
msg416268 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-29 15:54
I proposed it several years ago on the Python-Dev mailing list and that change was approved in general. The reorganization was deferred because there were several known bugs in the RE engine (fixes for which could potentially be backported) and there were not merged patches waiting for review. Now the patch for atomic groups was merged and bugs was fixed (thanks to Ma Lin).

Both the C code and the Python code for the re module are distributed on few files, which lie down in directories Modules and Lib. It makes difficult to work with all related files because they are intermixed with source files of different modules.

The following changes are planned:

1. Convert the re module into a package. Make sre_* modules its submodules.
2. Move C sources for the _sre module into a separate directory.
3. Extract the code for generating definitions of C constants from definitions of Python constants into a separate script and add it in the Tools/scripts directory (there are precedences: generate_token.py, etc).
msg416294 - (view) Author: Dominic Davis-Foster (dom1310df) Date: 2022-03-29 21:08
Could the sre_parse and sre_constants modules be kept with public names (i.e. without the leading underscore) but within the re namespace? I use them to tokenize and then syntax highlight regular expressions.

I did a quick search and found a few other users of the modules:

* pydoctor uses them for regex syntax highlighting[1], although it has its own copy of the sre_parse source rather than importing from stdlib.
* lark uses sre_parse to find minimum and maximum length of matching strings[2]
* sre_yield uses them to determine all strings that will match a regex[3]

The whole modules don't necessarily need exposing, but certainly sre_parse.parse, sre_parse.parse_template, and the opcodes from sre_constants would be the most useful.


[1] https://github.com/twisted/pydoctor/blob/c86273dffade5455890570142c8b7b068f5dffd1/pydoctor/epydoc/markup/_pyval_repr.py#L776
[2] https://github.com/lark-parser/lark/blob/85ea92ebf4e983e9997f9953a9c1463bb3d1c6cc/lark/utils.py#L120
[3] https://github.com/google/sre_yield/blob/3af063a0054c4646608b43b941fbfcbe4e01214a/sre_yield/__init__.py
msg416320 - (view) Author: Ma Lin (malin) * Date: 2022-03-30 03:26
Please don't merge too close to the 3.11 beta1 release date, I'll submit PRs after this merged.
msg416328 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-30 07:23
It turns out that pip uses sre_constants in its copy of pyparsing. The problem is already fixed in the upstream of pyparsing and soon should be fixed in pip. We still need to keep sre_constants and maybe other sre_* modules, but deprecate them.

> Could the sre_parse and sre_constants modules be kept with public names (i.e. without the leading underscore) but within the re namespace?

It is a good idea which will allow to minimize breakage in short term. You can write "from re import sre_parse", and it would work in old and new versions because sre_parse and sre_compile were imported in the re module. This trick does not work with sre_constants, you still need try/except.

But the code that depends on these modules is fragile and can be broken by other ways.

> Please don't merge too close to the 3.11 beta1 release date, I'll submit PRs after this merged.

I am going to implement step 2 only after merging your changes for issue23689.
msg416471 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-01 08:17
sre_constants, sre_compile and sre_parse are not tested and are not documented. I don't consider them as public API currently.

If someone has good reason to use them, IMO we must clearly define which exact API is needed, properly document and test it.

If we expose something, I don't think that the API would be exposed as re.sre_xxx.xxx, but as re.xxx. 

I suggest to hide sre_xxx submodules by adding an underscore to their name. Moreover, the "sre_" prefix is now redundant. I suggest renaming:

* sre_constants => re._constants
* sre_compile => re._compile
* sre_parse => re._parse
msg416497 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-04-01 16:03
I don't mind reorganizing this, but I would insist that we keep code using old undocumented things (like the sre_* modules) working for several releases, using the standard deprecation approach.
msg416502 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-01 17:50
Modules with old names are kept (deprecated). The questions are:

1. Should we keep the sre_ prefix in new submodules? Should we prefix them with underscores?
2. Should we keep only non-underscored names in the sre_* modules or undescored names too?
msg416523 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-04-01 23:53
1. If we're reorganizing anyway, I see no reason to keep the old names.
2. For maximum backwards compatibility, I'd say keep as much as you can, as long as keeping it won't interfere with the reorganization.
msg416543 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-02 08:35
New changeset 1be3260a90f16aae334d993aecf7b70426f98013 by Serhiy Storchaka in branch 'main':
bpo-47152: Convert the re module into a package (GH-32177)
https://github.com/python/cpython/commit/1be3260a90f16aae334d993aecf7b70426f98013
msg416545 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-02 11:59
$ ls Lib/re/
_compiler.py  _constants.py  __init__.py  _parser.py

Thanks, that's a nice enhancement!

Serhiy: Would you mind to explicitly document the 3 deprecated modules in What's New in Python 3.11?
https://docs.python.org/dev/whatsnew/3.11.html#deprecated
msg416547 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-02 12:01
Is the "import _locale" still used in re/__init__.py? It cannot see any reference to it in the code and test_re still if it's removed.

The last reference to the _locale module has been removed in 2017 by the commit 898ff03e1e7925ecde3da66327d3cdc7e07625ba.

diff --git a/Lib/re/__init__.py b/Lib/re/__init__.py
index c47a2650e3..b887722bbb 100644
--- a/Lib/re/__init__.py
+++ b/Lib/re/__init__.py
@@ -124,10 +124,6 @@
 import enum
 from . import _compiler, _parser
 import functools
-try:
-    import _locale
-except ImportError:
-    _locale = None
 
 
 # public symbols
msg416548 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-02 12:05
It's funny to still see mentions of "experimental stuff" in Python 3.11 (2022), whereas these "experimental stuff" are there for 20 years.

*Maybe* it's time to consider that re.template() and re.Scanner are no longer experimental? Maybe change their status to alpha or beta? :-D


commit 770617b23e286f1147f9480b5f625e88e7badd50
Author: Fredrik Lundh <fredrik@pythonware.com>
Date:   Sun Jan 14 15:06:11 2001 +0000

    SRE fixes for 2.1 alpha:

+# sre extensions (experimental, don't rely on these)
+T = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking


commit 7cafe4d7e466996d5fc32e871fe834e0e0c94282
Author: Fredrik Lundh <fredrik@pythonware.com>
Date:   Sun Jul 2 17:33:27 2000 +0000

    - actually enabled charset anchors in the engine (still not
      used by the code generator)
    
    - changed max repeat value in engine (to match earlier array fix)
    
    - added experimental "which part matched?" mechanism to sre; see
      http://hem.passagen.se/eff/2000_07_01_bot-archive.htm#416954
      or python-dev for details.


+# experimental stuff (see python-dev discussions for details)
+
+class Scanner:
(...)
msg416551 - (view) Author: Ma Lin (malin) * Date: 2022-04-02 12:46
In `Modules` folder, there are _sre.c/sre.h/sre_constants.h/sre_lib.h files. Will them be put into a folder?
msg416557 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2022-04-02 15:06
would it be possible to expose `parse_template` -- or at least some way to validate that a regex replacement string is correct prior to executing the replacement?

I'm currently using that for my text editor: https://github.com/asottile/babi/blob/d37d7d698d560aef7c6a0d1ec0668672e039bd9a/babi/screen.py#L501
msg416563 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-02 16:05
> Is the "import _locale" still used in re/__init__.py? It cannot see any reference to it in the code and test_re still if it's removed.

It is true.

> *Maybe* it's time to consider that re.template() and re.Scanner are no longer experimental? Maybe change their status to alpha or beta? :-D

First we need to find original discussions for these features (it may be not easy) and decide whether we want to finish them or remove.

> In `Modules` folder, there are _sre.c/sre.h/sre_constants.h/sre_lib.h files. Will them be put into a folder?

It is step 2.

> would it be possible to expose `parse_template` -- or at least some way to validate that a regex replacement string is correct prior to executing the replacement?

Maybe, in some form. Currently you can precompile a pattern, but for a replacement string you rely on a LRU cache. It is slower, and limited by the fixed size of the cache. I think it would be worth to add a function for compiling a replacement string. sub() etc should accept both string and a precompiled template object. It is a separate issue.
msg416591 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-02 21:49
Old python-dev discussions on re.Scanner from 2000 to 2004:

* "[Python-Dev] A standard lexer?" (July 2000)
  https://mail.python.org/archives/list/python-dev@python.org/message/MQ4OMCVIVRJWNGHYGI3OUVZQPN5NNNAU/
  thread: https://mail.python.org/archives/list/python-dev@python.org/thread/DLMYLYW3QRAAIZDEL3VA7M3TTUWMSPPB/#MQ4OMCVIVRJWNGHYGI3OUVZQPN5NNNAU

* "Scanner" (May 2001)
  https://mail.python.org/archives/list/python-dev@python.org/thread/7FGWHTFA2JT23TMVQXLGZLSKG7EGM44Q/#SVQBSSDWPYVHPRS363RWXWGKJTSEYQDP

* "iterator support for SRE?" (Oct 2001):
  https://mail.python.org/archives/list/python-dev@python.org/thread/IPJJX6MEW4ATOWHSRKLITL4CAZXDEJ5I/#IPJJX6MEW4ATOWHSRKLITL4CAZXDEJ5I

* "should sre.Scanner be exposed through re and documented?" (April 2003)
  https://mail.python.org/archives/list/python-dev@python.org/thread/BHVWYZVMDUJZIJMSSBAAXEH3JI7MTOIJ/#DDFDBY4D6OITPWO26Q5XPBFU7A5X6LXN

* "pre-PEP: Complete, Structured Regular Expression Group Matching" (Aug 2004)
  https://mail.python.org/archives/list/python-dev@python.org/thread/5M4YIZ2UFZF5AEWT3CGG74ZHERC6JV3B/#SNURCRGEYANPQVVQFZTY3LTXE2TFEKEP
  Search for "sre.Scanner".

  See also: "Using Regular Expressions for Lexical Analysis" (Feb 2002) by Fredrik Lundh
  https://web.archive.org/web/20200220172033/http://effbot.org/zone/xml-scanner.htm
msg416593 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-02 21:50
See also bpo-40259: "re.Scanner groups".
msg416595 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-04-02 21:52
The re.template() function and the re.TEMPLATE functions are not documented and not tested.

The re.Scanner class is not documented but has a test_scanner() test in test_re.
msg416615 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-03 11:01
There are two very different classes with similar names: _sre.SRE_Scanner and re.Scanner. The former is used to implement the Pattern.finditer() method, but it could be used in other cases. The latter is an experimental implementation of generalized lexer using the former class. Both are undocumented. It is difficult to document Pattern.scanner() and _sre.SRE_Scanner because the class name contains implementation-specific prefix, and without it it would conflict with re.Scanner.

But let leave it all to a separate issue.

The original discussion about TEMPLATE was lost. Initially it only affected repetition operators, but now using them with TEMPLATE is error.
msg416657 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-04 07:53
New changeset 1578f06c1c69fbbb942b90bfbacd512784b599fa by Serhiy Storchaka in branch 'main':
bpo-47152: Move sources of the _sre module into a subdirectory (GH-32290)
https://github.com/python/cpython/commit/1578f06c1c69fbbb942b90bfbacd512784b599fa
msg416659 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-04 09:00
New changeset ff2cf1d7d5fb25224f3ff2e0c678d36f78e1f3cb by Serhiy Storchaka in branch 'main':
bpo-47152: Remove unused import in re (GH-32298)
https://github.com/python/cpython/commit/ff2cf1d7d5fb25224f3ff2e0c678d36f78e1f3cb
msg416667 - (view) Author: Ma Lin (malin) * Date: 2022-04-04 12:26
Match.regs is an undocumented attribute, it seems it has existed since 1991. 
Can it be removed?

https://github.com/python/cpython/blob/ff2cf1d7d5fb25224f3ff2e0c678d36f78e1f3cb/Modules/_sre/sre.c#L2871
msg416676 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2022-04-04 17:04
For reference, I also implemented .regs in the regex module for compatibility, but I've never used it myself. I had to do some investigating to find out what it did!

It returns a tuple of the spans of the groups.

Perhaps I might have used it if it didn't have such a cryptic name and/or was documented.
msg416693 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-04 19:39
> Match.regs is an undocumented attribute, it seems it has existed since 1991. 
Can it be removed?

It was kept for compatibility with the pre-SRE implementation of the re module. It was an implementation detail in the original Python code, but I am sure that somebody still uses it. I am sure some code still use it. If we are going to remove it, it needs to be deprecated first.
msg416747 - (view) Author: Ma Lin (malin) * Date: 2022-04-05 04:26
> cryptic name

In very early versions, "mark" was called register/region.
https://github.com/python/cpython/blob/v1.0.1/Modules/regexpr.h#L48-L52

If span is accessed repeatedly, it's faster than Match.span().
Maybe consider renaming it, and make it as public attribute.
msg416761 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-04-05 08:10
See issue47211 for removing re.TEMPLATE.
History
Date User Action Args
2022-04-11 14:59:57adminsetgithub: 91308
2022-04-05 08:10:39serhiy.storchakasetmessages: + msg416761
2022-04-05 04:26:39malinsetmessages: + msg416747
2022-04-04 19:39:40serhiy.storchakasetmessages: + msg416693
2022-04-04 17:04:09mrabarnettsetmessages: + msg416676
2022-04-04 12:26:44malinsetmessages: + msg416667
2022-04-04 09:00:57serhiy.storchakasetmessages: + msg416659
2022-04-04 07:53:35serhiy.storchakasetmessages: + msg416657
2022-04-04 07:33:44serhiy.storchakasetpull_requests: + pull_request30357
2022-04-03 17:25:49serhiy.storchakasetpull_requests: + pull_request30351
2022-04-03 11:01:28serhiy.storchakasetmessages: + msg416615
2022-04-02 21:52:52vstinnersetmessages: + msg416595
2022-04-02 21:50:36vstinnersetmessages: + msg416593
2022-04-02 21:49:15vstinnersetmessages: + msg416591
2022-04-02 16:05:08serhiy.storchakasetmessages: + msg416563
2022-04-02 15:06:17Anthony Sottilesetnosy: + Anthony Sottile
messages: + msg416557
2022-04-02 12:46:17malinsetmessages: + msg416551
2022-04-02 12:05:55vstinnersetmessages: + msg416548
2022-04-02 12:01:28vstinnersetmessages: + msg416547
2022-04-02 11:59:25vstinnersetmessages: + msg416545
2022-04-02 08:35:27serhiy.storchakasetmessages: + msg416543
2022-04-01 23:53:02gvanrossumsetmessages: + msg416523
2022-04-01 17:50:47serhiy.storchakasetmessages: + msg416502
2022-04-01 16:03:36gvanrossumsetmessages: + msg416497
2022-04-01 08:17:23vstinnersetnosy: + vstinner
messages: + msg416471
2022-03-30 08:52:05malinsetpull_requests: + pull_request30266
2022-03-30 07:23:55serhiy.storchakasetmessages: + msg416328
2022-03-30 03:26:27malinsetmessages: + msg416320
2022-03-29 21:08:35dom1310dfsetnosy: + dom1310df
messages: + msg416294
2022-03-29 16:15:37serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request30255
2022-03-29 15:54:32serhiy.storchakacreate