classification
Title: re does not honor matching trailing multiple periods
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: bsaner, eric.smith, serhiy.storchaka
Priority: normal Keywords:

Created on 2019-07-14 19:27 by bsaner, last changed 2019-07-14 22:35 by eric.smith. This issue is now closed.

Files
File name Uploaded Description Edit
example.py bsaner, 2019-07-14 19:27 exhibition of bug
Messages (8)
msg347933 - (view) Author: brent s. (bsaner) Date: 2019-07-14 19:27
(Sorry for the title; not quite sure how to summarize this)

SO! Have I got an interesting one for you.

ISSUE:
In release 3.7.3 (and possibly later), the re module, if one has a string e.g. 'a.b.', a pattern such as '\.*$' will successfully *match* any number of multiple trailing periods. HOWEVER, when attempting to substitute those with actual character(s), it chokes. See attached poc.py

NOTES:
- This *is a regression* from 2.6.6, 2.7.16, and 3.6.7 (other releases were not tested). This behaviour does not occur on those versions.
msg347934 - (view) Author: brent s. (bsaner) Date: 2019-07-14 19:29
Sorry- by "chokes", I mean "substitutes in multiple replacements".
msg347935 - (view) Author: brent s. (bsaner) Date: 2019-07-14 19:34
WORKAROUND:

Obviously, str.rstrip('.') still works, but this is of course quite inflexible compared to a regex pattern.
msg347936 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2019-07-14 19:39
'\.' is an invalid escape sequence. Could you try it with a raw string?

Also, it's not really clear to me what you're seeing, vs. what you expect to see. For one example that you think is incorrect, could you show what you get vs. what you expect to get? And, if that's different on different python versions, could you show what each version does?
msg347937 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-14 20:00
This change was intentional and documented. It fixed old bug in the Python implementation of RE and removed the discrepancy with other RE engines.

The pattern r'\.*$' matches not only a sequence of dots at the of the line, but also an empty string at the end of line. If this is not what you want, use r'\.+$'.
msg347942 - (view) Author: brent s. (bsaner) Date: 2019-07-14 21:17
"'\.' is an invalid escape sequence. Could you try it with a raw string?"

Well, a valid regex escape, but right. Point taken. I am under the impression, however, that given the value in ptrn (in example.py) is already a string, it should be interpreted as a raw string in the re.compile(), no? Because otherwise it'd be a dickens of a time getting a regex pattern that's dynamic/programmatically assigned to a name, since there's no raw(), str.raw(), or str.encode('raw').

They both evaluate to the same, for what it's worth:

>>> repr('\.+$')
"'\\\\.+$'"
>>> repr(r'\.+$')
"'\\\\.+$'"
>>> ptrn = '\.+$'
>>> repr(ptrn)
"'\\\\.+$'"

So.

"Also, it's not really clear to me what you're seeing, vs. what you expect to see. For one example that you think is incorrect, could you show what you get vs. what you expect to get? And, if that's different on different python versions, could you show what each version does?"

The comment from Serhiy clarifies that this was indeed something that was changed. You can see the difference pretty easily by just calling the example.py between python2 and python3.

--

"This change was intentional and documented. It fixed old bug in the Python implementation of RE and removed the discrepancy with other RE engines."

Okay, so I'm not going insane. That's good. Do you have the bug ID it fixes and where it's documented? Do you know which other RE engines were doing this? Because GNU sed, for instance, does not behave like this - it behaves as the "pre-bugfix" behaviour did:

$ echo 'a.b.' | sed -e 's/\.*$/./g'
a.b.
$ echo 'a.b...' | sed -e 's/\.*$/./g'
a.b.
$ echo 'a.b' | sed -e 's/\.*$/./g'
a.b.

"The pattern r'\.*$' matches not only a sequence of dots at the of the line, but also an empty string at the end of line. If this is not what you want, use r'\.+$'."

Right; it's to guarantee there is one and only one period at the end of a line, whether there is no period, one period, or many periods in the original string (think e.g. enforcing RFC1025-compatible FQDNs, for instance).
msg347943 - (view) Author: brent s. (bsaner) Date: 2019-07-14 21:31
Oh for pete's sake. I wish I could edit comments.

Eric-

To make it clear:

*****

VERSION: 2.7.16 (default, Mar 11 2019, 18:59:25) 
[GCC 8.2.1 20181127]
PATTERN: \.*$

BEFORE: a.b
WITHOUT: a.b
DUMMY: a.bX
AFTER: a.b.
RSTRIP: a.b
==
BEFORE: a.b.
WITHOUT: a.b
DUMMY: a.bX
AFTER: a.b.
RSTRIP: a.b
==
BEFORE: a.b..
WITHOUT: a.b
DUMMY: a.bX
AFTER: a.b.
RSTRIP: a.b
==
BEFORE: a.b...
WITHOUT: a.b
DUMMY: a.bX
AFTER: a.b.
RSTRIP: a.b
==

*****

VERSION: 3.7.3 (default, Jun 24 2019, 04:54:02) 
[GCC 9.1.0]
PATTERN: \.*$

BEFORE: a.b
WITHOUT: a.b
DUMMY: a.bX
AFTER: a.b.
RSTRIP: a.b
==
BEFORE: a.b.
WITHOUT: a.b
DUMMY: a.bXX
AFTER: a.b..
RSTRIP: a.b
==
BEFORE: a.b..
WITHOUT: a.b
DUMMY: a.bXX
AFTER: a.b..
RSTRIP: a.b
==
BEFORE: a.b...
WITHOUT: a.b
DUMMY: a.bXX
AFTER: a.b..
RSTRIP: a.b
==


Note the differences between versions for cases a.b., a.b.., and a.b... ("BEFORE: ..." lines). Compare their "AFTER" and "DUMMY" lines between python2 and python3.



Serhiy-

Apologies; I meant RFC1035; I typo'd that. But as shown above, the difference is pretty distinct (and inconsistent with GNU sed behaviour).
msg347944 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2019-07-14 22:35
Sorry. '\.' will be invalid in the future. I got ahead of myself.

$ python3 -Werror -q
>>> '\.'
  File "<stdin>", line 1
SyntaxError: invalid escape sequence \.


Not that it would have affected your issue, so I apologize for the red herring. But "switch to raw strings when you have backslashes in a regex" is always my first reaction.
History
Date User Action Args
2019-07-14 22:35:13eric.smithsetmessages: + msg347944
2019-07-14 21:31:19bsanersetmessages: + msg347943
2019-07-14 21:17:55bsanersetmessages: + msg347942
2019-07-14 20:00:42serhiy.storchakasetstatus: open -> closed

nosy: + serhiy.storchaka
messages: + msg347937

resolution: not a bug
stage: resolved
2019-07-14 19:39:26eric.smithsetnosy: + eric.smith
messages: + msg347936
2019-07-14 19:34:49bsanersetmessages: + msg347935
2019-07-14 19:29:11bsanersetmessages: + msg347934
2019-07-14 19:27:30bsanercreate