Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation for advanced set syntax in regular expressions #74534

Closed
serhiy-storchaka opened this issue May 12, 2017 · 8 comments
Closed

Preparation for advanced set syntax in regular expressions #74534

serhiy-storchaka opened this issue May 12, 2017 · 8 comments
Assignees
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 30349
Nosy @rhettinger, @ezio-melotti, @bitdancer, @serhiy-storchaka, @timgraham, @pombredanne
PRs
  • bpo-30349: Raise FutureWarning for nested sets and set operations #1553
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2017-11-16.10:39:01.603>
    created_at = <Date 2017-05-12.08:14:13.370>
    labels = ['expert-regex', 'type-feature', 'library', '3.7']
    title = 'Preparation for advanced set syntax in regular expressions'
    updated_at = <Date 2021-09-21.09:51:26.713>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2021-09-21.09:51:26.713>
    actor = 'pombredanne'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2017-11-16.10:39:01.603>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2017-05-12.08:14:13.370>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30349
    keywords = []
    message_count = 8.0
    messages = ['293532', '303757', '306349', '311682', '311684', '311688', '402299', '402303']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'ezio.melotti', 'mrabarnett', 'r.david.murray', 'serhiy.storchaka', 'Tim.Graham', 'pombredanne']
    pr_nums = ['1553']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue30349'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Currently the re module supports only simple sets. They can include literal characters, character ranges, some simple character classes and support the negation. The Unicode standard [1] defines set operations (union, intersection, difference and symmetric difference) and nested sets. Some regular expression engines implemented these features, for example the regex module supports all TR18 features except not-nested POSIX character classes.

    If replace the re module with the regex module or add support of these features in the re module and make this syntax enabled by default, this will break some code. It is very unlikely the the regular expression contains duplicated characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and non-escaped '[' is occurred in character sets in regular expressions (even the stdlib contains several occurrences).

    Proposed patch adds FutureWarnings emitted when possible breaking set construct ('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need one or two releases with a warning before changing syntax. The patch also makes re.escape() escaping '&' and '~' and fixes several regular expression in the stdlib.

    Alternatively the support of new set syntax could be enabled by special flag.

    I'm not sure that the support of set operations and nested sets is necessary. This complicates the syntax of regular expressions (which already is not simple). Currently set operations can be emulated with lookarounds:

    [set1||set2] -- (?:[set1]|[set2])
    [set1&&set2] -- set1 or (?=[set1])[set2]
    [set1--set2] -- set1 or set1 or (?=[set1])[^set2]
    [set1~~set2] -- recursively expand [[set1||set2]--[set1&&set2]]

    [1] http://unicode.org/reports/tr18/#Subtraction_and_Intersection

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label May 12, 2017
    @serhiy-storchaka serhiy-storchaka self-assigned this May 12, 2017
    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement labels May 12, 2017
    @serhiy-storchaka
    Copy link
    Member Author

    Made a warning for '[' be emitted only at the start of a set. This significantly decrease the breakage of other code. I think we can get around without implicit union of nested sets, like in [[0-9][:Latin:]]. This can be written as [||[0-9]||[:Latin:]].

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 05cb728 by Serhiy Storchaka in branch 'master':
    bpo-30349: Raise FutureWarning for nested sets and set operations (bpo-1553)
    05cb728

    @timgraham
    Copy link
    Mannequin

    timgraham mannequin commented Feb 5, 2018

    It might be worth adding part of the problematic regex to the warning message. For Django's tests, I see an error like "FutureWarning: Possible nested set at position 17 return re.compile(res).match". It took some effort to track down the source.

    A partial traceback is:
    File "/home/tim/code/django/django/core/management/commands/loaddata.py", line 247, in find_fixtures
    for candidate in glob.iglob(glob.escape(path) + '*'):
    File "/home/tim/code/cpython/Lib/glob.py", line 72, in _iglob
    for name in glob_in_dir(dirname, basename, dironly):
    File "/home/tim/code/cpython/Lib/glob.py", line 83, in _glob1
    return fnmatch.filter(names, pattern)
    File "/home/tim/code/cpython/Lib/fnmatch.py", line 52, in filter
    match = _compile_pattern(pat)
    File "/home/tim/code/cpython/Lib/fnmatch.py", line 46, in _compile_pattern
    return re.compile(res).match
    File "/home/tim/code/cpython/Lib/re.py", line 240, in compile
    return _compile(pattern, flags)
    File "/home/tim/code/cpython/Lib/re.py", line 292, in _compile
    p = sre_compile.compile(pattern, flags)
    File "/home/tim/code/cpython/Lib/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
    File "/home/tim/code/cpython/Lib/sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
    File "/home/tim/code/cpython/Lib/sre_parse.py", line 816, in _parse
    p = _parse_sub(source, state, sub_verbose, nested + 1)
    File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
    File "/home/tim/code/cpython/Lib/sre_parse.py", line 524, in _parse
    FutureWarning, stacklevel=nested + 6
    FutureWarning: Possible nested set at position 17

    As an aside, I'm not sure how to fix the warning in Django. It comes from the test added in django/django@98df288 where a path like 'tests/fixtures/fixtures/fixture_with[special]chars' is run through glob.escape() which creates 'tests/fixtures/fixtures/fixture_with[[]special]chars'.

    @serhiy-storchaka
    Copy link
    Member Author

    Good catch! fnmatch.translate() can produce a pattern which emits a warning when compiled. Could you please open a separate issue for this?

    @timgraham
    Copy link
    Mannequin

    timgraham mannequin commented Feb 5, 2018

    Okay, I created bpo-32775.

    @pombredanne
    Copy link
    Mannequin

    pombredanne mannequin commented Sep 21, 2021

    FWIW, this warning is annoying because it is hard to fix in the case where the regex are source from data: the warning message does not include the regex at fault; it should otherwise the warning is noisy and ineffective IMHO.

    @pombredanne
    Copy link
    Mannequin

    pombredanne mannequin commented Sep 21, 2021

    Sorry, my comment was at best nonsensical gibberish!

    I meant to say that this warning message should include the actual regex at fault; otherwise it is hard to fix when the regex in question comes from some data structure like a list; then the line number where the warning occurs is not enough to fix the issue; the code needs to be instrumented first to catch warning which is rather heavy handed to handle a warning.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant