Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve some re error messages using regex for hints #66560

Closed
serhiy-storchaka opened this issue Sep 8, 2014 · 27 comments
Closed

Improve some re error messages using regex for hints #66560

serhiy-storchaka opened this issue Sep 8, 2014 · 27 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 22364
Nosy @rhettinger, @terryjreedy, @pitrou, @ezio-melotti, @stevendaprano, @serhiy-storchaka
Dependencies
  • bpo-22578: Add additional attributes to re.error
  • bpo-22581: Other mentions of the buffer protocol
  • Files
  • re_errors_regex.patch
  • re_errors.patch
  • re_errors_diff.txt
  • re_errors_2.patch
  • regex_errors.diff
  • re_errors_3.patch
  • regex_errors2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-03-25.19:05:44.319>
    created_at = <Date 2014-09-08.12:22:56.243>
    labels = ['expert-regex', 'type-feature', 'library']
    title = 'Improve some re error messages using regex for hints'
    updated_at = <Date 2015-03-25.19:05:44.318>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2015-03-25.19:05:44.318>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-03-25.19:05:44.319>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2014-09-08.12:22:56.243>
    creator = 'serhiy.storchaka'
    dependencies = ['22578', '22581']
    files = ['37167', '38035', '38036', '38080', '38171', '38229', '38230']
    hgrepos = []
    issue_num = 22364
    keywords = ['patch']
    message_count = 27.0
    messages = ['226575', '226576', '226599', '226635', '226641', '226790', '226847', '227065', '227084', '227119', '227130', '227132', '228599', '230464', '230481', '230965', '231057', '235532', '235534', '235678', '236188', '236201', '236257', '236549', '236551', '236954', '239279']
    nosy_count = 9.0
    nosy_names = ['rhettinger', 'terry.reedy', 'pitrou', 'ezio.melotti', 'mrabarnett', 'steven.daprano', 'BreamoreBoy', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue22364'
    versions = ['Python 3.5']

    @serhiy-storchaka
    Copy link
    Member Author

    In some cases standard re module and third-party regex modules raise exceptions with different error messages.

    1. re.match(re.compile('.'), 'A', re.I)

    re: Cannot process flags argument with a compiled pattern
    regex: can't process flags argument with a compiled pattern

    1. re.compile('(?P<foo_123')

    re: unterminated name
    regex: missing >

    1. re.compile('(?P<foo_123>a)(?P=foo_123')

    re: unterminated name
    regex: missing )

    1. regex.sub('(?P<a>x)', r'\g<a', 'xx')

    re: unterminated group name
    regex: missing >

    1. re.sub('(?P<a>x)', r'\g<', 'xx')

    re: unterminated group name
    regex: bad group name

    1. re.sub('(?P<a>x)', r'\g<a a>', 'xx')

    re: bad character in group name
    regex: bad group name

    1. re.sub('(?P<a>x)', r'\g<-1>', 'xx')

    re: negative group number
    regex: bad group name

    1. re.compile('(?P<foo_123>a)(?P=!)')

    re: bad character in backref group name '!'
    regex: bad group name

    1. re.sub('(?P<a>x)', r'\g', 'xx')

    re: missing group name
    regex: missing <

    1. re.compile('a\\')
      re.sub('x', '\\', 'x')

    re: bogus escape (end of line)
    regex: bad escape

    1. re.compile(r'\1')

    re: bogus escape: '\1'
    regex: unknown group

    1. re.compile('[a-')

    re: unexpected end of regular expression
    regex: bad set

    1. re.sub(b'.', 'b', b'c')

    re: expected bytes, bytearray, or an object with the buffer interface, str found
    regex: expected bytes instance, str found

    1. re.compile(r'\w', re.UNICODE | re.ASCII)

    re: ASCII and UNICODE flags are incompatible
    regex: ASCII, LOCALE and UNICODE flags are mutually incompatible

    1. re.compile('(abc')

    re: unbalanced parenthesis
    regex: missing )

    1. re.compile('abc)')

    re: unbalanced parenthesis
    regex: trailing characters in pattern

    1. re.compile(r'((.)\1+)')

    re: cannot refer to open group
    regex: can't refer to an open group

    Looks as in one case re messages are better, and in other cases regex messages are better. In any case it would be good to unify error messages in both modules.

    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement labels Sep 8, 2014
    @serhiy-storchaka
    Copy link
    Member Author

    1. re.compile(r'.???')

    re: multiple repeat
    regex: nothing to repeat at position 3

    @stevendaprano
    Copy link
    Member

    I'm dubious about this issue. It suggests that the wording of the exceptions is part of the API of the two modules.

    If the idea is just to copy the best error messages from one module to the other, then I guess there is no harm. But if the idea is to guarantee to keep the two modules' messages in sync, then I think it is unnecessary and harmful.

    @serhiy-storchaka
    Copy link
    Member Author

    Yes, the idea of this issue is to enhance the re module (and the regex module if Matthew will) be picking the best error messages (or writing a new one).

    @serhiy-storchaka serhiy-storchaka self-assigned this Sep 9, 2014
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 9, 2014

    re: Cannot process flags argument with a compiled pattern
    regex: can't process flags argument with a compiled pattern

    Error messages usually start with a lowercase letter, and I think that all the other ones in the re module do.

    By the way, which is preferred, "cannot" or "can't"? The regex module always uses "can't", but re module uses "cannot" except for "TypeError: can't use a bytes pattern on a string-like object", I think.

    Also, you said that one of the re module's messages was better, but didn't say which! Did you mean this one?

    re: expected bytes, bytearray, or an object with the buffer interface, str found
    regex: expected bytes instance, str found

    @serhiy-storchaka
    Copy link
    Member Author

    By the way, which is preferred, "cannot" or "can't"? The regex module always
    uses "can't", but re module uses "cannot" except for "TypeError: can't use
    a bytes pattern on a string-like object", I think.

    It's interesting question. Grepping in CPython sources got results:

    Cannot 210
    cannot 865
    Can't 216
    can't 796

    Lowercase wins uppercase with score 4:1 and short and long forms are
    equivalent.

    I left the decision to English speakers.

    Also, you said that one of the re module's messages was better, but didn't
    say which! Did you mean this one?
    > re: expected bytes, bytearray, or an object with the buffer interface,
    > str found
    > regex: expected bytes instance, str found

    Both are not good. re variant is too verbose, but it is more correct.

    May be 6, 7, 8, 10, 11, 16, 18 are better in re.

    @terryjreedy
    Copy link
    Member

    I prefer "cannot" for error messages. "Can't" is an informal version of "cannot", used in speech, dialog representing speech, and 'informal' writing. It looks wrong to me in this context.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 18, 2014

    How can anything that's in the stdlib be unified with something that's not in the stdlib and currently has no prospects of getting in the stdlib?

    @serhiy-storchaka
    Copy link
    Member Author

    The regex module is potential candidate for replacement of the re module.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 19, 2014

    The key word is "potential". I do not believe that any changes should be made to the re module until such time as there is a fully approved PEP for the regex module and that work has actually started on getting it into the stdlib. Surely backward compatibility also comes into this?

    @terryjreedy
    Copy link
    Member

    Steven and Mark are correct that a tracker patch cannot change a 3rd party module. On the other hand, we are free to improve error messages in new versions. And we are free to borrow ideas from 3rd part modules. I changed the title accordingly.

    (Back compatibility comes into play in not making message enhancements in bugfix releases even though message details are not part of the documented API. People who write code that depends on those details, and doctexts need not so depend, should expect to revise for new versions. I expect that some of our re tests would need to be changed.)

    Re and regex are a bit special in that regex is the only re replacement (that I know of) and is (almost) a drop-in replacement. So some people *are*, on their own, replacing re with regex by installing regex (easy with pip) and adding 'import regex as re' at the top of their code.

    Serhiy suggested either picking the best or writing a new one, I think a new one combining both would be best in many of the cases. As a user, I like "name missing terminal '>'" for #2 (is there an adjective for a name in this context?) and for #4, "group name missing terminal '>'". (Note that we usually quote literals, as in #8.) For #12, I would like a parallel construction "set expression missing terminal ']'" if that is possible. But the currently vague re message "unexpected end of regular expression" might be raised as a point where the specific information is lost and only the general version is correct.

    As for #14, either UNICODE and LOCALE *are* compatible (for re) or this is buggy.
    >>> import re
    >>> re.compile(r'\w', re.UNICODE | re.LOCALE)
    re.compile('\\w', re.LOCALE|re.UNICODE)

    @terryjreedy terryjreedy changed the title Unify error messages of re and regex Improve some re error messages using regex for hints Sep 19, 2014
    @stevendaprano
    Copy link
    Member

    On Fri, Sep 19, 2014 at 08:41:57PM +0000, Mark Lawrence wrote:

    I do not believe that any changes should be made to the re module
    until such time as there is a fully approved PEP [....]

    Why is this so controversial? We're not talking about functional changes
    to the re module, we're talking about improving error messages. Firstly,
    the actual wording of error messages are not part of the API and are
    subject to change without notice. Secondly, nobody is talking about
    keeping the two modules syncronised on an on-going basis. This is just
    to improve the re error messages using regex as inspiration.

    @stevendaprano stevendaprano changed the title Improve some re error messages using regex for hints Unify error messages of re and regex Sep 19, 2014
    @serhiy-storchaka
    Copy link
    Member Author

    As for #14, either UNICODE and LOCALE *are* compatible (for re) or this is buggy.

    This is buggy (bpo-22407).

    @serhiy-storchaka serhiy-storchaka changed the title Unify error messages of re and regex Improve some re error messages using regex for hints Oct 5, 2014
    @ezio-melotti
    Copy link
    Member

    +1 on the idea.

    @rhettinger
    Copy link
    Contributor

    +1

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a patch which makes re error messages match regex. It doesn't look to me that all these changes are enhancements.

    @terryjreedy
    Copy link
    Member

    I already said we should either stick with what we have if better (and gave examples, including sticking with 'cannot') or possibly combine the best of both if we can improve on both. 13 should use 'bytes-like' (already changed?). There is no review button.

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a patch which unify and improves re error messages. Added tests for all parsing errors. Now error message always points on the start of affected component, i.e. on the start of bad escape, group name or unterminated subpattern.

    @serhiy-storchaka
    Copy link
    Member Author

    re_errors_diff.txt contains differences for all tested error messages.

    @serhiy-storchaka
    Copy link
    Member Author

    Updated patch addresses Ezio's comments.

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a patch for regex which makes some error messages be the same as in re with re_errors_2.patch. You could apply it to regex if new error messages look better than old error messages. Otherwise we could change re error messages to match regex, or discuss better variants.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Feb 18, 2015

    Some error messages use the indefinite article:

    "expected a bytes-like object, %.200s found"
    "cannot use a bytes pattern on a string-like object"
    "cannot use a string pattern on a bytes-like object"
    

    but others don't:

    "expected string instance, %.200s found"
    "expected str instance, %.200s found"
    

    Messages tend to be abbreviated, so I think that it would be better to just omit the article.

    I don't think that the error message "bad repeat interval" is an improvement (Why is it "bad"? What is an "interval"?). I think that saying that the min is greater than the max is clearer.

    @serhiy-storchaka
    Copy link
    Member Author

    Messages tend to be abbreviated, so I think that it would be better to just
    omit the article.

    I agree, but this is came from standard error messages which are not
    consistent. I opened a thread on Python-Dev.

    "expected a bytes-like object" and "expected str instance" are standard error
    messages raised in bytes.join and str.join, not in re. We could change them
    though.

    I don't think that the error message "bad repeat interval" is an improvement
    (Why is it "bad"? What is an "interval"?). I think that saying that the min
    is greater than the max is clearer.

    Agree. I'll change this in re. What message is better in case of overflow: "the
    repetition number is too large" (in re) or "repeat count too big" (in regex)?

    @serhiy-storchaka
    Copy link
    Member Author

    Updated patch borrows the error message about min > max from regex.

    @serhiy-storchaka
    Copy link
    Member Author

    Removed changing TypeError errors and "bad repeat interval" error in updated regex patch.

    @serhiy-storchaka
    Copy link
    Member Author

    Could anyone please make a review? This patch is a prerequisite of other patches.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 25, 2015

    New changeset 068365acbe73 by Serhiy Storchaka in branch 'default':
    Issue bpo-22364: Improved some re error messages using regex for hints.
    https://hg.python.org/cpython/rev/068365acbe73

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants