classification
Title: Preparation for advanced set syntax in regular expressions
Type: enhancement Stage: patch review
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, r.david.murray, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-05-12 08:14 by serhiy.storchaka, last changed 2017-10-05 10:26 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 1553 open serhiy.storchaka, 2017-05-12 08:20
Messages (2)
msg293532 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-12 08:14
Currently the re module supports only simple sets. They can include literal characters, character ranges, some simple character classes and support the negation. The Unicode standard [1] defines set operations (union, intersection, difference and symmetric difference) and nested sets. Some regular expression engines implemented these features, for example the regex module supports all TR18 features except not-nested POSIX character classes.

If replace the re module with the regex module or add support of these features in the re module and make this syntax enabled by default, this will break some code. It is very unlikely the the regular expression contains duplicated characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and non-escaped '[' is occurred in character sets in regular expressions (even the stdlib contains several occurrences).

Proposed patch adds FutureWarnings emitted when possible breaking set construct ('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need one or two releases with a warning before changing syntax. The patch also makes re.escape() escaping '&' and '~' and fixes several regular expression in the stdlib.

Alternatively the support of new set syntax could be enabled by special flag.

I'm not sure that the support of set operations and nested sets is necessary. This complicates the syntax of regular expressions (which already is not simple). Currently set operations can be emulated with lookarounds:

[set1||set2] -- (?:[set1]|[set2])
[set1&&set2] -- [set1](?<=[set2]) or (?=[set1])[set2]
[set1--set2] -- [set1](?<![set2]) or [set1](?<=[^set2]) or (?=[set1])[^set2]
[set1~~set2] -- recursively expand [[set1||set2]--[set1&&set2]]

[1] http://unicode.org/reports/tr18/#Subtraction_and_Intersection
msg303757 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-05 10:26
Made a warning for '[' be emitted only at the start of a set. This significantly decrease the breakage of other code. I think we can get around without implicit union of nested sets, like in [_[0-9][:Latin:]]. This can be written as [_||[0-9]||[:Latin:]].
History
Date User Action Args
2017-10-05 10:26:52serhiy.storchakasetmessages: + msg303757
2017-05-12 08:20:39serhiy.storchakasetpull_requests: + pull_request1650
2017-05-12 08:14:13serhiy.storchakacreate