Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize the re module sources #91308

Closed
serhiy-storchaka opened this issue Mar 29, 2022 · 31 comments
Closed

Reorganize the re module sources #91308

serhiy-storchaka opened this issue Mar 29, 2022 · 31 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir topic-regex

Comments

@serhiy-storchaka
Copy link
Member

BPO 47152
Nosy @gvanrossum, @vstinner, @ezio-melotti, @serhiy-storchaka, @animalize, @asottile
PRs
  • bpo-47152: Convert the re module into a package #32177
  • [WIP] bpo-23689: re module, fix memory leak when a match is terminated by a signal #32188
  • bpo-47152: Move sources of the _sre module into a subdirectory #32290
  • bpo-47152: Remove unused import in re #32298
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-03-29.15:54:32.145>
    labels = ['expert-regex', 'library', '3.11']
    title = 'Reorganize the re module sources'
    updated_at = <Date 2022-04-05.08:10:39.512>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2022-04-05.08:10:39.512>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2022-03-29.15:54:32.145>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 47152
    keywords = ['patch']
    message_count = 26.0
    messages = ['416268', '416294', '416320', '416328', '416471', '416497', '416502', '416523', '416543', '416545', '416547', '416548', '416551', '416557', '416563', '416591', '416593', '416595', '416615', '416657', '416659', '416667', '416676', '416693', '416747', '416761']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'vstinner', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'malin', 'Anthony Sottile', 'dom1310df']
    pr_nums = ['32177', '32188', '32290', '32298']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue47152'
    versions = ['Python 3.11']

    @serhiy-storchaka
    Copy link
    Member Author

    I proposed it several years ago on the Python-Dev mailing list and that change was approved in general. The reorganization was deferred because there were several known bugs in the RE engine (fixes for which could potentially be backported) and there were not merged patches waiting for review. Now the patch for atomic groups was merged and bugs was fixed (thanks to Ma Lin).

    Both the C code and the Python code for the re module are distributed on few files, which lie down in directories Modules and Lib. It makes difficult to work with all related files because they are intermixed with source files of different modules.

    The following changes are planned:

    1. Convert the re module into a package. Make sre_* modules its submodules.
    2. Move C sources for the _sre module into a separate directory.
    3. Extract the code for generating definitions of C constants from definitions of Python constants into a separate script and add it in the Tools/scripts directory (there are precedences: generate_token.py, etc).

    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes stdlib Python modules in the Lib dir topic-regex labels Mar 29, 2022
    @dom1310df
    Copy link
    Mannequin

    dom1310df mannequin commented Mar 29, 2022

    Could the sre_parse and sre_constants modules be kept with public names (i.e. without the leading underscore) but within the re namespace? I use them to tokenize and then syntax highlight regular expressions.

    I did a quick search and found a few other users of the modules:

    • pydoctor uses them for regex syntax highlighting[1], although it has its own copy of the sre_parse source rather than importing from stdlib.
    • lark uses sre_parse to find minimum and maximum length of matching strings[2]
    • sre_yield uses them to determine all strings that will match a regex[3]

    The whole modules don't necessarily need exposing, but certainly sre_parse.parse, sre_parse.parse_template, and the opcodes from sre_constants would be the most useful.

    [1] https://github.com/twisted/pydoctor/blob/c86273dffade5455890570142c8b7b068f5dffd1/pydoctor/epydoc/markup/_pyval_repr.py#L776
    [2] https://github.com/lark-parser/lark/blob/85ea92ebf4e983e9997f9953a9c1463bb3d1c6cc/lark/utils.py#L120
    [3] https://github.com/google/sre_yield/blob/3af063a0054c4646608b43b941fbfcbe4e01214a/sre_yield/__init__.py

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 30, 2022

    Please don't merge too close to the 3.11 beta1 release date, I'll submit PRs after this merged.

    @serhiy-storchaka
    Copy link
    Member Author

    It turns out that pip uses sre_constants in its copy of pyparsing. The problem is already fixed in the upstream of pyparsing and soon should be fixed in pip. We still need to keep sre_constants and maybe other sre_* modules, but deprecate them.

    Could the sre_parse and sre_constants modules be kept with public names (i.e. without the leading underscore) but within the re namespace?

    It is a good idea which will allow to minimize breakage in short term. You can write "from re import sre_parse", and it would work in old and new versions because sre_parse and sre_compile were imported in the re module. This trick does not work with sre_constants, you still need try/except.

    But the code that depends on these modules is fragile and can be broken by other ways.

    Please don't merge too close to the 3.11 beta1 release date, I'll submit PRs after this merged.

    I am going to implement step 2 only after merging your changes for bpo-23689.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 1, 2022

    sre_constants, sre_compile and sre_parse are not tested and are not documented. I don't consider them as public API currently.

    If someone has good reason to use them, IMO we must clearly define which exact API is needed, properly document and test it.

    If we expose something, I don't think that the API would be exposed as re.sre_xxx.xxx, but as re.xxx.

    I suggest to hide sre_xxx submodules by adding an underscore to their name. Moreover, the "sre_" prefix is now redundant. I suggest renaming:

    • sre_constants => re._constants
    • sre_compile => re._compile
    • sre_parse => re._parse

    @gvanrossum
    Copy link
    Member

    I don't mind reorganizing this, but I would insist that we keep code using old undocumented things (like the sre_* modules) working for several releases, using the standard deprecation approach.

    @serhiy-storchaka
    Copy link
    Member Author

    Modules with old names are kept (deprecated). The questions are:

    1. Should we keep the sre_ prefix in new submodules? Should we prefix them with underscores?
    2. Should we keep only non-underscored names in the sre_* modules or undescored names too?

    @gvanrossum
    Copy link
    Member

    1. If we're reorganizing anyway, I see no reason to keep the old names.
    2. For maximum backwards compatibility, I'd say keep as much as you can, as long as keeping it won't interfere with the reorganization.

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 1be3260 by Serhiy Storchaka in branch 'main':
    bpo-47152: Convert the re module into a package (GH-32177)
    1be3260

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    $ ls Lib/re/
    _compiler.py  _constants.py  __init__.py  _parser.py

    Thanks, that's a nice enhancement!

    Serhiy: Would you mind to explicitly document the 3 deprecated modules in What's New in Python 3.11?
    https://docs.python.org/dev/whatsnew/3.11.html#deprecated

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    Is the "import _locale" still used in re/init.py? It cannot see any reference to it in the code and test_re still if it's removed.

    The last reference to the _locale module has been removed in 2017 by the commit 898ff03.

    diff --git a/Lib/re/__init__.py b/Lib/re/__init__.py
    index c47a2650e3..b887722bbb 100644
    --- a/Lib/re/__init__.py
    +++ b/Lib/re/__init__.py
    @@ -124,10 +124,6 @@
     import enum
     from . import _compiler, _parser
     import functools
    -try:
    -    import _locale
    -except ImportError:
    -    _locale = None
     
     
     # public symbols

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    It's funny to still see mentions of "experimental stuff" in Python 3.11 (2022), whereas these "experimental stuff" are there for 20 years.

    Maybe it's time to consider that re.template() and re.Scanner are no longer experimental? Maybe change their status to alpha or beta? :-D

    commit 770617b23e286f1147f9480b5f625e88e7badd50
    Author: Fredrik Lundh \<fredrik@pythonware.com\>
    Date:   Sun Jan 14 15:06:11 2001 +0000
    
        SRE fixes for 2.1 alpha:
    
    +# sre extensions (experimental, don't rely on these)
    +T = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking
    
    
    commit 7cafe4d7e466996d5fc32e871fe834e0e0c94282
    Author: Fredrik Lundh \<fredrik@pythonware.com\>
    Date:   Sun Jul 2 17:33:27 2000 +0000
    
        - actually enabled charset anchors in the engine (still not
          used by the code generator)
        
        - changed max repeat value in engine (to match earlier array fix)
        
        - added experimental "which part matched?" mechanism to sre; see
          http://hem.passagen.se/eff/2000_07_01_bot-archive.htm#416954
          or python-dev for details.
    
    
    +# experimental stuff (see python-dev discussions for details)
    +
    +class Scanner:
    (...)
    

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Apr 2, 2022

    In Modules folder, there are _sre.c/sre.h/sre_constants.h/sre_lib.h files. Will them be put into a folder?

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Apr 2, 2022

    would it be possible to expose parse_template -- or at least some way to validate that a regex replacement string is correct prior to executing the replacement?

    I'm currently using that for my text editor: https://github.com/asottile/babi/blob/d37d7d698d560aef7c6a0d1ec0668672e039bd9a/babi/screen.py#L501

    @serhiy-storchaka
    Copy link
    Member Author

    Is the "import _locale" still used in re/init.py? It cannot see any reference to it in the code and test_re still if it's removed.

    It is true.

    *Maybe* it's time to consider that re.template() and re.Scanner are no longer experimental? Maybe change their status to alpha or beta? :-D

    First we need to find original discussions for these features (it may be not easy) and decide whether we want to finish them or remove.

    In Modules folder, there are _sre.c/sre.h/sre_constants.h/sre_lib.h files. Will them be put into a folder?

    It is step 2.

    would it be possible to expose parse_template -- or at least some way to validate that a regex replacement string is correct prior to executing the replacement?

    Maybe, in some form. Currently you can precompile a pattern, but for a replacement string you rely on a LRU cache. It is slower, and limited by the fixed size of the cache. I think it would be worth to add a function for compiling a replacement string. sub() etc should accept both string and a precompiled template object. It is a separate issue.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    See also bpo-40259: "re.Scanner groups".

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    The re.template() function and the re.TEMPLATE functions are not documented and not tested.

    The re.Scanner class is not documented but has a test_scanner() test in test_re.

    @serhiy-storchaka
    Copy link
    Member Author

    There are two very different classes with similar names: _sre.SRE_Scanner and re.Scanner. The former is used to implement the Pattern.finditer() method, but it could be used in other cases. The latter is an experimental implementation of generalized lexer using the former class. Both are undocumented. It is difficult to document Pattern.scanner() and _sre.SRE_Scanner because the class name contains implementation-specific prefix, and without it it would conflict with re.Scanner.

    But let leave it all to a separate issue.

    The original discussion about TEMPLATE was lost. Initially it only affected repetition operators, but now using them with TEMPLATE is error.

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 1578f06 by Serhiy Storchaka in branch 'main':
    bpo-47152: Move sources of the _sre module into a subdirectory (GH-32290)
    1578f06

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset ff2cf1d by Serhiy Storchaka in branch 'main':
    bpo-47152: Remove unused import in re (GH-32298)
    ff2cf1d

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Apr 4, 2022

    Match.regs is an undocumented attribute, it seems it has existed since 1991.
    Can it be removed?

    {"regs", (getter)match_regs_get, (setter)NULL},

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Apr 4, 2022

    For reference, I also implemented .regs in the regex module for compatibility, but I've never used it myself. I had to do some investigating to find out what it did!

    It returns a tuple of the spans of the groups.

    Perhaps I might have used it if it didn't have such a cryptic name and/or was documented.

    @serhiy-storchaka
    Copy link
    Member Author

    Match.regs is an undocumented attribute, it seems it has existed since 1991.
    Can it be removed?

    It was kept for compatibility with the pre-SRE implementation of the re module. It was an implementation detail in the original Python code, but I am sure that somebody still uses it. I am sure some code still use it. If we are going to remove it, it needs to be deprecated first.

    jpuhlman pushed a commit to MontaVista-OpenSourceTechnology/poky that referenced this issue Sep 19, 2022
    Source: poky
    MR: 121051
    Type: Integration
    Disposition: Merged from poky
    ChangeID: a18a014
    Description:
    
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: c98007217b8e40f1abfdcba709185dc5ddbcd0c2)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    (cherry picked from commit 3c0cd401472ffee06d5a93bdba566cb033851fcf)
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    Signed-off-by: Jeremy A. Puhlman <jpuhlman@mvista.com>
    slyon added a commit to slyon/rstr that referenced this issue Nov 8, 2022
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    slyon added a commit to slyon/rstr that referenced this issue Nov 8, 2022
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    slyon added a commit to slyon/rstr that referenced this issue Nov 8, 2022
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    fruch added a commit to fruch/scylla-cqlsh that referenced this issue Dec 28, 2022
    as part of cleanup in pyhton source code,
    some of the regex code is using internal resources
    that was renamed.
    
    Ref: python/cpython#91308
    fruch added a commit to fruch/scylla-cqlsh that referenced this issue Dec 28, 2022
    as part of cleanup in pyhton source code,
    some of the regex code is using internal resources
    that was renamed.
    
    Ref: python/cpython#91308
    fruch added a commit to scylladb/scylla-cqlsh that referenced this issue Dec 28, 2022
    as part of cleanup in pyhton source code,
    some of the regex code is using internal resources
    that was renamed.
    
    Ref: python/cpython#91308
    rpurdie added a commit to yoctoproject/poky that referenced this issue Feb 22, 2023
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: a4cd5b0b4b355b7b75fb48c61289700e3e908b2a)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    halstead pushed a commit to openembedded/bitbake that referenced this issue Feb 22, 2023
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    jpuhlman pushed a commit to MontaVista-OpenSourceTechnology/poky that referenced this issue Feb 24, 2023
    Source: poky
    MR: 125004
    Type: Integration
    Disposition: Merged from poky
    ChangeID: 86e2430d3f40433f978667f15ab6d20d0663e56d
    Description:
    
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: a4cd5b0b4b355b7b75fb48c61289700e3e908b2a)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    Signed-off-by: Jeremy A. Puhlman <jpuhlman@mvista.com>
    shr-project pushed a commit to shr-distribution/bitbake that referenced this issue Feb 24, 2023
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    shr-project pushed a commit to shr-distribution/bitbake that referenced this issue Feb 24, 2023
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    shr-project pushed a commit to shr-distribution/bitbake that referenced this issue Feb 24, 2023
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    johanneskoester added a commit to snakemake/snakemake that referenced this issue Mar 1, 2023
    - `sre_constants` is deprecated, the import statement was updated
    accordingly
    
    ### Description
    
    The `re` module sources are being reorganized, see more information:
    python/cpython#91308
    Small fix accommodates changes for all python versions.
    
    Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
    openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Mar 27, 2023
    * Update openstacksdk from branch 'master'
      to 6c76a8cf7b962f9da293e04f7443ca4c805d7428
      - Merge "Remove usage of deprecated `sre_constants` module"
      - Remove usage of deprecated `sre_constants` module
        
        Remove usage of undocumented `sre_constants`
        module deprecated in `python3.11` and use `re` module instead.
        
        As dicsussed in python/cpython#91308, `sre_constants` is undocumented and it's usage was deprecated starting with `python3.11`, where it causes a deprecation warning. https://docs.python.org/3/whatsnew/3.11.html#modules
        
        Importing `sre_constants` for exception handling of invalid regular expressions in not necessary, as the same exception class is exposed through the `re` module.
        
        Change-Id: Ifd9cccf504a5493683152178ebef9183f30b7f4c
    openstack-mirroring pushed a commit to openstack/openstacksdk that referenced this issue Mar 27, 2023
    Remove usage of undocumented `sre_constants`
    module deprecated in `python3.11` and use `re` module instead.
    
    As dicsussed in python/cpython#91308, `sre_constants` is undocumented and it's usage was deprecated starting with `python3.11`, where it causes a deprecation warning. https://docs.python.org/3/whatsnew/3.11.html#modules
    
    Importing `sre_constants` for exception handling of invalid regular expressions in not necessary, as the same exception class is exposed through the `re` module.
    
    Change-Id: Ifd9cccf504a5493683152178ebef9183f30b7f4c
    bjmc added a commit to leapfrogonline/rstr that referenced this issue Apr 18, 2023
    * xeger: avoid deprecated sre_parse module in python-3.11+
    
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    
    * tox: enable Python 3.11
    
    ---------
    
    Co-authored-by: Brendan McCollam <brendan@mccoll.am>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants