Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading/inaccurate documentation about unknown escape sequences in regular expressions #72636

Closed
lelit mannequin opened this issue Oct 15, 2016 · 14 comments
Closed

Misleading/inaccurate documentation about unknown escape sequences in regular expressions #72636

lelit mannequin opened this issue Oct 15, 2016 · 14 comments
Labels
3.7 (EOL) end of life docs Documentation in the Doc dir topic-regex type-feature A feature request or enhancement

Comments

@lelit
Copy link
Mannequin

lelit mannequin commented Oct 15, 2016

BPO 28450
Nosy @warsaw, @nedbat, @ned-deily, @ezio-melotti, @Rosuav, @serhiy-storchaka, @lelit, @Vgr255
PRs
  • bpo-28450: Fix and improve the documentation for unknown escapes in RE. #11920
  • [3.7] bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920). #12029
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-02-25.16:30:13.925>
    created_at = <Date 2016-10-15.11:00:13.128>
    labels = ['expert-regex', 'type-feature', '3.7', 'docs']
    title = 'Misleading/inaccurate documentation about unknown escape sequences in regular expressions'
    updated_at = <Date 2019-02-25.16:30:13.925>
    user = 'https://github.com/lelit'

    bugs.python.org fields:

    activity = <Date 2019-02-25.16:30:13.925>
    actor = 'serhiy.storchaka'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2019-02-25.16:30:13.925>
    closer = 'serhiy.storchaka'
    components = ['Documentation', 'Regular Expressions']
    creation = <Date 2016-10-15.11:00:13.128>
    creator = 'lelit'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 28450
    keywords = ['patch']
    message_count = 14.0
    messages = ['278716', '278749', '281499', '281500', '281501', '281502', '281504', '281512', '281943', '281947', '282573', '306364', '336535', '336539']
    nosy_count = 10.0
    nosy_names = ['barry', 'nedbat', 'ned.deily', 'ezio.melotti', 'mrabarnett', 'docs@python', 'Rosuav', 'serhiy.storchaka', 'lelit', 'abarry']
    pr_nums = ['11920', '12029']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue28450'
    versions = ['Python 3.5', 'Python 3.6', 'Python 3.7']

    @lelit
    Copy link
    Mannequin Author

    lelit mannequin commented Oct 15, 2016

    Python 3.6+ is stricter about escaped sequences in string literals.

    The documentation need some improvement to clarify the change: for example https://docs.python.org/3.6/library/re.html#re.sub first says that “Unknown escapes such as \& are left alone” then, in the “Changed in” section below, states that “[in Py3.6] Unknown escapes consisting of '\' and an ASCII letter now are errors”.

    When such changes are made, usually the documentation reports the “new”/“current” behaviour, and the history section mention when and how some detail changed.

    See this thread for details: https://mail.python.org/pipermail/python-list/2016-October/715462.html

    @lelit lelit mannequin assigned docspython Oct 15, 2016
    @lelit lelit mannequin added the docs Documentation in the Doc dir label Oct 15, 2016
    @serhiy-storchaka
    Copy link
    Member

    Thank you for your report Lele. Agreed, the documentation looks misleading.

    Do you want to provide more clear wording?

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Oct 16, 2016
    @serhiy-storchaka serhiy-storchaka changed the title Misleading/inaccurate documentation about unknown escape sequences Misleading/inaccurate documentation about unknown escape sequences in regular expressions Oct 16, 2016
    @serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement topic-regex labels Oct 16, 2016
    @serhiy-storchaka
    Copy link
    Member

    Maybe just remove the phrase "Unknown escapes such as \& are left alone"?

    @warsaw
    Copy link
    Member

    warsaw commented Nov 22, 2016

    I disagree that the documentation is at fault. This is known to break existing code, e.g. http://bugs.python.org/msg281496

    I think it's not correct to change the documentation but leave the error-raising behavior for 3.6 because the deprecation was never documented in 3.5 so this will look like a gratuitous regression. bpo-27030 for reference.

    I also question whether it makes sense for such escapes to be illegal in the repl argument of re.sub(). I could understand for this limitation in the pattern argument, but that's not what's causing the error.

    @serhiy-storchaka
    Copy link
    Member

    The deprecation was documented in 3.5.

    https://docs.python.org/3.5/library/re.html#re.sub

    Deprecated since version 3.5, will be removed in version 3.6: Unknown escapes consist of '\' and ASCII letter now raise a deprecation warning and will be forbidden in Python 3.6.

    @serhiy-storchaka
    Copy link
    Member

    The reason for disallowing some undefined escapes is the same as in pattern strings: this would allow as to introduce new special escape sequences. For example:

    • \N{...} for named character escape.
    • Perl and extended PCRE use \L and \U for making lower and upper casing of the replacement. \U is already used for other purpose, but you have an idea.

    Of course the need in new special escape sequences in template string is much less then in pattern string.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Nov 22, 2016

    @barry: repl already supports some escapes, e.g. \g<name> for named groups, although not \xXX et al, so deprecating unknown escapes like in the pattern makes sense to me.

    BTW, the regex module already supports \xXX, \N{XXX}, etc.

    @warsaw
    Copy link
    Member

    warsaw commented Nov 22, 2016

    On Nov 22, 2016, at 07:28 PM, Serhiy Storchaka wrote:

    The reason for disallowing some undefined escapes is the same as in pattern
    strings: this would allow as to introduce new special escape sequences.

    I'll note that technically speaking, you can still introduce new escapes for
    repl without breaking the documented contract. All the docs say are that
    "unknown escapes such as \& are left alone", but that doesn't list what are
    unknown escapes. So if new escapes are added in Python 3.7, and they are
    transformed in repl, that would be allowed.

    I'll also note that not *all* unknown sequences are rejected now, only
    backslashes followed by an ASCII letter. So \& is still probably left alone,
    while \s is now rejected. That does add to the confusion, although the
    deprecation note in the re.sub() documentation does document the new behavior
    correctly.

    On Nov 22, 2016, at 07:55 PM, R. David Murray wrote:

    There is still the argument that we shouldn't break 2.7 compatibility
    unnecessarily until 2.7 is out of maintenance. That is: warnings are good,
    removals are bad. (I haven't read through this issue, so I may be off base.)

    This is also a reasonable argument, but not one I've thought about since I'm
    using Python 2 only rarely these days.

    On Nov 22, 2016, at 07:34 PM, Serhiy Storchaka wrote:

    If you insist I could revert converting warnings to errors (only in
    replacement string or all?) in 3.6.

    pattern is a regular expression string so it already follows the syntax as
    described in $6.2.1 Regular Expression Syntax. But I think a reading of that
    section (and the "special sequences" bit that follows) could also argue that
    unknown escapes shouldn't throw an error.

    But I think they should left errors in 3.7. The earlier we make undefined
    escapes the errors, the earlier we can define new special escape sequences
    without confusing users. It is bad if the escape sequence is valid in two
    Python versions but has different meaning.

    Perhaps so, but I do think this is a tricky question from a compatibility
    point of view. One possible optional, although it's late in the cycle, would
    be to introduce a new flag so the user could tell re exactly what behavior
    they want. The default would have to be backward compatible (i.e. leave
    unknown sequences alone), but there could be say an re.STRICTESCAPES flag that
    would cause the error to be thrown.

    @ned-deily
    Copy link
    Member

    Where do we stand on this issue? At the moment, 3.6.0 is on track to be released as is.

    @serhiy-storchaka
    Copy link
    Member

    I think we should discuss this on Python-Dev.

    @ned-deily
    Copy link
    Member

    Note that 1b162d6e3d01 in bpo-27030 (for 3.6.0rc1) has changed the behavior for re.sub replacement templates to produce a deprecation warning in 3.6 while still being treated as an error in 3.7.

    @serhiy-storchaka
    Copy link
    Member

    Barry, could you please improve the documentation about unknown escape sequences in regular expressions? My skills is not enough for this.

    @serhiy-storchaka
    Copy link
    Member

    New changeset a180b00 by Serhiy Storchaka in branch 'master':
    bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920)
    a180b00

    @serhiy-storchaka
    Copy link
    Member

    New changeset 95fc8e6 by Serhiy Storchaka in branch '3.7':
    [3.7] bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920). (GH-12029)
    95fc8e6

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life docs Documentation in the Doc dir topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants