classification
Title: Checking if two regexes are equal should test if they are functionally equivalent
Type: enhancement Stage: resolved
Components: Regular Expressions Versions: Python 3.9
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ammar2, boris, ezio.melotti, mrabarnett, ronaldoussoren, serhiy.storchaka
Priority: normal Keywords:

Created on 2019-11-01 06:22 by boris, last changed 2020-12-07 09:27 by serhiy.storchaka. This issue is now closed.

Messages (6)
msg355791 - (view) Author: Борис Верховский (boris) * Date: 2019-11-01 06:22
re.compile('([-_.a-zA-Z0-9]+)') == re.compile(r'([-\w.]+)') 

should return True because those are the same regex (\w is a-z, A-Z, 0-9 and the underscore). 

If you want to check if two regexes are identical you would compare the original strings, before re.compile.
msg355792 - (view) Author: Ammar Askar (ammar2) * (Python committer) Date: 2019-11-01 07:07
The notion of equivalent regular expressions does exist but is way more complicated than the simple example you described.

For example:

r"a|b" is the same as r"[ab]",
r"^aa*$" is the same as r"^a+$"

Implementing this properly would probably require a significant amount of effort, and just implementing simple equivalence for character classes would be really surprising.

Could you explain the use case and motivation behind this request?
msg355793 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-01 07:20
These two regexes are not the same.

>>> re.compile('([-_.a-zA-Z0-9]+)').match('ä')
>>> re.compile(r'([-\w.]+)').match('ä')
<re.Match object; span=(0, 1), match='ä'>

As Ammar said checking that two regexes always matches the same is very difficult problem. It is the problem of determining if two programs are the same.
msg355795 - (view) Author: Борис Верховский (boris) * Date: 2019-11-01 08:16
I saw two Python regexes, one derived from a regex in the standard library. There was a comment saying that they're interchangeable and I wanted to check if they were actually the same without having to read what all the regex symbols mean. Plus a true comparison would be more correct.

Besides that I don't have a real use case. There's a few people asking about doing this if you Google for "check if two regexes are equivalent". Some for Python specifically.

@serhiy I read what \w meant from the first Google result and got it wrong. Sounds like a good argument for why Python should be able to do this for me :)

Regexes are not Turing complete, so not quite. If I understand it correctly, comparing them is somewhere between comparing two graphs and comparing two programs (which is generally impossible). Theoretically it's a decidable problem, but the extra logic that Python's implementation has (for dealing with unicode and whatever) makes it hard, but it should still be theoretically possible, unless Python's regexes are somehow Turing complete.
msg355797 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-01 08:43
Python's regular expressions are not actually regular. Lookarounds and groups make things more complex. Even it it is possible to build an ambiguous graph, its size and the time of building can be too large for practical use. Dealing with unicode is only minor part of the problem.

I suggest you first to implement this feature as a third-party module on PyPI. After this we can consider including it in the stdlib.
msg382604 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-12-06 19:57
https://math.stackexchange.com/questions/46975/how-to-prove-two-regular-expressions-are-identical-in-mathematical-way contains some references to proofs that checking if two regular expressions are equivalent is PSPACE-complete.  Another answer in that question mentions that its at least exponential in both space and time.  Either means that comparing two REs gets infeasible very fast, which makes this unsuited for an equality checking.

I propose closing this issue.
History
Date User Action Args
2020-12-07 09:27:22serhiy.storchakasetstatus: open -> closed
resolution: rejected
stage: resolved
2020-12-06 19:57:26ronaldoussorensetnosy: + ronaldoussoren
messages: + msg382604
2019-11-01 08:43:53serhiy.storchakasetmessages: + msg355797
2019-11-01 08:16:40borissetmessages: + msg355795
2019-11-01 07:20:02serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg355793
2019-11-01 07:07:07ammar2setnosy: + ammar2
messages: + msg355792
2019-11-01 06:22:35boriscreate