classification
Title: Misleading/inaccurate documentation about unknown escape sequences in regular expressions
Type: enhancement Stage: resolved
Components: Documentation, Regular Expressions Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Rosuav, barry, docs@python, ebarry, ezio.melotti, lelit, mrabarnett, ned.deily, nedbat, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2016-10-15 11:00 by lelit, last changed 2019-02-25 16:30 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11920 merged serhiy.storchaka, 2019-02-18 15:17
PR 12029 merged serhiy.storchaka, 2019-02-25 16:18
Messages (14)
msg278716 - (view) Author: Lele Gaifax (lelit) * Date: 2016-10-15 11:00
Python 3.6+ is stricter about escaped sequences in string literals.

The documentation need some improvement to clarify the change: for example https://docs.python.org/3.6/library/re.html#re.sub first says that “Unknown escapes such as \& are left alone” then, in the “Changed in” section below, states that “[in Py3.6] Unknown escapes consisting of '\' and an ASCII letter now are errors”.

When such changes are made, usually the documentation reports the “new”/“current” behaviour, and the history section mention when and how some detail changed.

See this thread for details: https://mail.python.org/pipermail/python-list/2016-October/715462.html
msg278749 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-16 08:04
Thank you for your report Lele. Agreed, the documentation looks misleading.

Do you want to provide more clear wording?
msg281499 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-22 19:08
Maybe just remove the phrase "Unknown escapes such as \& are left alone"?
msg281500 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016-11-22 19:10
I disagree that the documentation is at fault.  This is known to break existing code, e.g. http://bugs.python.org/msg281496

I think it's not correct to change the documentation but leave the error-raising behavior for 3.6 because the deprecation was never documented in 3.5 so this will look like a gratuitous regression.  issue27030 for reference.

I also question whether it makes sense for such escapes to be illegal in the repl argument of re.sub().  I could understand for this limitation in the pattern argument, but that's not what's causing the error.
msg281501 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-22 19:16
The deprecation was documented in 3.5.

https://docs.python.org/3.5/library/re.html#re.sub

Deprecated since version 3.5, will be removed in version 3.6: Unknown escapes consist of '\' and ASCII letter now raise a deprecation warning and will be forbidden in Python 3.6.
msg281502 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-22 19:28
The reason for disallowing some undefined escapes is the same as in pattern strings: this would allow as to introduce new special escape sequences. For example:

* \N{...} for named character escape.
* Perl and extended PCRE use \L and \U for making lower and upper casing of the replacement. \U is already used for other purpose, but you have an idea.

Of course the need in new special escape sequences in template string is much less then in pattern string.
msg281504 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2016-11-22 19:42
@Barry: repl already supports some escapes, e.g. \g<name> for named groups, although not \xXX et al, so deprecating unknown escapes like in the pattern makes sense to me.

BTW, the regex module already supports \xXX, \N{XXX}, etc.
msg281512 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016-11-22 20:28
On Nov 22, 2016, at 07:28 PM, Serhiy Storchaka wrote:

>The reason for disallowing some undefined escapes is the same as in pattern
>strings: this would allow as to introduce new special escape sequences.

I'll note that technically speaking, you can still introduce new escapes for
repl without breaking the documented contract.  All the docs say are that
"unknown escapes such as \& are left alone", but that doesn't list what are
unknown escapes.  So if new escapes are added in Python 3.7, and they are
transformed in repl, that would be allowed.

I'll also note that not *all* unknown sequences are rejected now, only
backslashes followed by an ASCII letter.  So \& is still probably left alone,
while \s is now rejected.  That does add to the confusion, although the
deprecation note in the re.sub() documentation does document the new behavior
correctly.

On Nov 22, 2016, at 07:55 PM, R. David Murray wrote:

>There is still the argument that we shouldn't break 2.7 compatibility
>unnecessarily until 2.7 is out of maintenance.  That is: warnings are good,
>removals are bad.  (I haven't read through this issue, so I may be off base.)

This is also a reasonable argument, but not one I've thought about since I'm
using Python 2 only rarely these days.

On Nov 22, 2016, at 07:34 PM, Serhiy Storchaka wrote:

>If you insist I could revert converting warnings to errors (only in
>replacement string or all?) in 3.6.

pattern is a regular expression string so it already follows the syntax as
described in $6.2.1 Regular Expression Syntax.  But I think a reading of that
section (and the "special sequences" bit that follows) could also argue that
unknown escapes shouldn't throw an error.

>But I think they should left errors in 3.7. The earlier we make undefined
>escapes the errors, the earlier we can define new special escape sequences
>without confusing users. It is bad if the escape sequence is valid in two
>Python versions but has different meaning.

Perhaps so, but I do think this is a tricky question from a compatibility
point of view.  One possible optional, although it's late in the cycle, would
be to introduce a new flag so the user could tell re exactly what behavior
they want.  The default would have to be backward compatible (i.e. leave
unknown sequences alone), but there could be say an re.STRICTESCAPES flag that
would cause the error to be thrown.
msg281943 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-11-29 03:55
Where do we stand on this issue?  At the moment, 3.6.0 is on track to be released as is.
msg281947 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-29 05:02
I think we should discuss this on Python-Dev.
msg282573 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-12-06 22:30
Note that 1b162d6e3d01 in Issue27030 (for 3.6.0rc1) has changed the behavior for re.sub replacement templates to produce a deprecation warning in 3.6 while still being treated as an error in 3.7.
msg306364 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-16 15:01
Barry, could you please improve the documentation about unknown escape sequences in regular expressions? My skills is not enough for this.
msg336535 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-25 15:58
New changeset a180b007d96fe68b32f11dec720fbd0cd5b6758a by Serhiy Storchaka in branch 'master':
bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920)
https://github.com/python/cpython/commit/a180b007d96fe68b32f11dec720fbd0cd5b6758a
msg336539 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-25 16:28
New changeset 95fc8e687c487ecf97f4b1b98dfc0c05e3c9cbff by Serhiy Storchaka in branch '3.7':
[3.7] bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920). (GH-12029)
https://github.com/python/cpython/commit/95fc8e687c487ecf97f4b1b98dfc0c05e3c9cbff
History
Date User Action Args
2019-02-25 16:30:13serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-02-25 16:28:55serhiy.storchakasetmessages: + msg336539
2019-02-25 16:18:04serhiy.storchakasetpull_requests: + pull_request12060
2019-02-25 15:58:33serhiy.storchakasetmessages: + msg336535
2019-02-18 15:17:11serhiy.storchakasetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request11945
2019-02-14 16:47:40serhiy.storchakalinkissue35846 superseder
2017-11-16 15:01:37serhiy.storchakasetmessages: + msg306364
2016-12-06 22:30:30ned.deilysetmessages: + msg282573
2016-11-29 05:02:15serhiy.storchakasetmessages: + msg281947
2016-11-29 03:55:31ned.deilysetnosy: + ned.deily
messages: + msg281943
2016-11-22 21:01:50ebarrysetnosy: + ebarry
2016-11-22 20:28:47barrysetmessages: + msg281512
2016-11-22 19:42:41mrabarnettsetmessages: + msg281504
2016-11-22 19:28:56serhiy.storchakasetmessages: + msg281502
2016-11-22 19:16:45serhiy.storchakasetmessages: + msg281501
2016-11-22 19:10:59barrysetnosy: + barry
messages: + msg281500
2016-11-22 19:08:06serhiy.storchakasetmessages: + msg281499
2016-10-16 08:12:09serhiy.storchakasetnosy: + ezio.melotti
components: + Regular Expressions
2016-10-16 08:04:14serhiy.storchakasetversions: + Python 3.5, Python 3.7
type: enhancement

nosy: + nedbat, serhiy.storchaka, Rosuav, mrabarnett
title: Misleading/inaccurate documentation about unknown escape sequences -> Misleading/inaccurate documentation about unknown escape sequences in regular expressions
messages: + msg278749
stage: needs patch
2016-10-15 11:00:13lelitcreate