Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different error messages for same error - invalid string prefixes #84427

Closed
lysnikolaou opened this issue Apr 10, 2020 · 41 comments
Closed

Different error messages for same error - invalid string prefixes #84427

lysnikolaou opened this issue Apr 10, 2020 · 41 comments
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@lysnikolaou
Copy link
Contributor

BPO 40246
Nosy @gvanrossum, @terryjreedy, @vstinner, @ericvsmith, @encukou, @serhiy-storchaka, @hroncok, @lysnikolaou, @pablogsal, @aeros
PRs
  • bpo-40246: Report a different error message for invalid string prefixes #19476
  • bpo-40246: Revert reporting of invalid string prefixes #19888
  • [3.9] bpo-40246: Fix test_fstring when run with the old parser #20402
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-05-26.00:12:34.805>
    created_at = <Date 2020-04-10.18:05:58.302>
    labels = ['interpreter-core', 'type-bug', '3.9']
    title = 'Different error messages for same error - invalid string prefixes'
    updated_at = <Date 2020-05-26.00:12:34.804>
    user = 'https://github.com/lysnikolaou'

    bugs.python.org fields:

    activity = <Date 2020-05-26.00:12:34.804>
    actor = 'lys.nikolaou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-05-26.00:12:34.805>
    closer = 'lys.nikolaou'
    components = ['Interpreter Core']
    creation = <Date 2020-04-10.18:05:58.302>
    creator = 'lys.nikolaou'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40246
    keywords = ['patch']
    message_count = 41.0
    messages = ['366143', '366146', '366164', '366165', '366187', '366197', '366202', '366215', '366260', '367586', '367587', '367588', '367591', '367592', '367597', '367599', '367611', '367628', '367649', '367658', '367660', '367664', '367672', '367689', '367690', '367691', '367693', '367698', '367977', '368038', '368039', '368040', '368045', '368062', '368191', '368919', '368920', '368925', '369077', '369217', '369931']
    nosy_count = 10.0
    nosy_names = ['gvanrossum', 'terry.reedy', 'vstinner', 'eric.smith', 'petr.viktorin', 'serhiy.storchaka', 'hroncok', 'lys.nikolaou', 'pablogsal', 'aeros']
    pr_nums = ['19476', '19888', '20402']
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue40246'
    versions = ['Python 3.9']

    @lysnikolaou
    Copy link
    Contributor Author

    While testing pegen, we found this out:

    Python 3.9.0a5+ (heads/pegen:502dfb719e, Apr 10 2020, 20:57:05) 
    [GCC 9.2.1 20191008] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> fur''
      File "<stdin>", line 1
        fur''
           ^
    SyntaxError: invalid syntax
    >>> eval("fur''")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        fur''
           ^
    SyntaxError: unexpected EOF while parsing

    @lysnikolaou lysnikolaou added 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Apr 10, 2020
    @lysnikolaou
    Copy link
    Contributor Author

    After some investigating, I found out that this is happening because when we execute fur'' in the interactive interpreter there is an implicit newline, which means that EOF does not get reached.

    OTOH there is no newline when passing it as a string inside an eval call. After the tokenizer encounters the first quote, it calls tok_nextc twice more, in order to differentiate between a single-quoted and a triple-quoted string, thus reaching EOF and proapgating an EOF error up through the perrdetail struct.

    @gvanrossum
    Copy link
    Member

    So a slightly shorter example uses ru''. This is an error because you can't combine the r prefix and the u prefix (in fact you can't combine anything with the r prefix).

    I declare that this is a bug to report EOF here and the code should *first* check for a valid combination (e.g. fr'' or rf'') and only *then* try to distinguish between single and triple quotes.

    @lysnikolaou
    Copy link
    Contributor Author

    At the moment, if the prefix is invalid, it gets tokenized as a NAME node and the string following is tokenized as a STRING node with no prefix.

    ╰─ echo "ur''" | ./python -m tokenize
    1,0-1,2: NAME 'ur'
    1,2-1,4: STRING "''"
    1,4-1,5: NEWLINE '\n'
    2,0-2,0: ENDMARKER ''

    ╰─ echo "f''" | ./python -m tokenize
    1,0-1,3: STRING "f''"
    1,3-1,4: NEWLINE '\n'
    2,0-2,0: ENDMARKER ''

    @gvanrossum
    Copy link
    Member

    But that is the tokenize.py module. It does not have 100% the same behavior as the builtin tokenizer.c module. In this case it probably doesn't.

    @lysnikolaou
    Copy link
    Contributor Author

    In my understanding of the C code that's what the C tokenizer is doing as well.

    Here's the relevant snippet of the tokenizer (

    if (is_potential_identifier_start(c)) {
    ): When the tokenizer sees a valid identifier start, it goes into a loop that checks for a valid combination of string prefixes. If the combination is valid and it sees a quote directly after that, it goto's to the STRING-handling code. If not, then it breaks out of the loop and returns a NAME node.

    Am I missing something?

    @lysnikolaou
    Copy link
    Contributor Author

    I have working code that checks if there is a quotation mark right after an identifier. Here is an example:

    ╰─ ./python
    Python 3.9.0a5+ (heads/pegen-dirty:502dfb719e, Apr 11 2020, 14:43:12) 
    [GCC 9.2.1 20191008] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> ur''
      File "<stdin>", line 1
        ur''
         ^
    SyntaxError: invalid string prefix

    One observation about this is that it has precedence over an EOL error:

    >>> ur'
      File "<stdin>", line 1
        ur'
         ^
    SyntaxError: invalid string prefix

    Would that be acceptable?

    @gvanrossum
    Copy link
    Member

    yes

    On Sat, Apr 11, 2020 at 04:46 Lysandros Nikolaou <report@bugs.python.org>
    wrote:

    Lysandros Nikolaou <lisandrosnik@gmail.com> added the comment:

    I have working code that checks if there is a quotation mark right after
    an identifier. Here is an example:

    ╰─ ./python
    Python 3.9.0a5+ (heads/pegen-dirty:502dfb719e, Apr 11 2020, 14:43:12)
    [GCC 9.2.1 20191008] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> ur''
    File "<stdin>", line 1
    ur''
    ^
    SyntaxError: invalid string prefix

    One observation about this is that it has precedence over an EOL error:

    >>> ur'
    File "<stdin>", line 1
    ur'
    ^
    SyntaxError: invalid string prefix

    Would that be acceptable?

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue40246\>


    --
    --Guido (mobile)

    @pablogsal
    Copy link
    Member

    New changeset 41d5b94 by Lysandros Nikolaou in branch 'master':
    bpo-40246: Report a better error message for invalid string prefixes (GH-19476)
    41d5b94

    @lysnikolaou
    Copy link
    Contributor Author

    On bpo-40334 @vstinner reported that this change broke some code in turtledemo. The code looks like this:

    print(dict(state=clear, bg="#d00" if clear else"#fca"))

    Due to the absence of a space between else and ", the else keyword is now interpreted as an invalid string prefix. Is that acceptable?

    @vstinner
    Copy link
    Member

    I reopen the issue.

    The following code works on Python 3.8, but fails with SyntaxError on Python 3.9.0a6 with the old and the new parser (see bpo-40431):

        clear = "NORMAL"
        print(dict(state=clear, bg="#d00" if clear else"#fca"))

    Well, the code looks like a typo error... but it worked previously.

    Not sure if we should keep the SyntaxError or not. Fixing the code is trivial: see PR 19777 attached to bpo-40431; replace >else"#fca"< with >else "#fca"<.

    Note: I first reported the issue to https://bugs.python.org/issue40334#msg367580

    @vstinner vstinner reopened this Apr 29, 2020
    @vstinner vstinner reopened this Apr 29, 2020
    @pablogsal
    Copy link
    Member

    I personally think the new behaviour is more consistent and more likely to catch errors.

    @vstinner
    Copy link
    Member

    I personally think the new behaviour is more consistent and more likely to catch errors.

    Would it be possible to emit a SyntaxWarning in 3.9 and only emit a SyntaxError in 3.10?

    @gvanrossum
    Copy link
    Member

    Personally I don't think that else"stuff" should become a syntax error. It's no different from other edge cases where for some reason the space seems optional, like 1and 2.

    @vstinner
    Copy link
    Member

    I think that a linter can be very pedantic on such code. My concern is about backward compatibility.

    The >else"#fca"< code was added 6 years ago (commit 8b95d5e) and nobody complained, it "just worked". If we keep the SyntaxError, random projects will be broken, but I don't see much benefits for the users.

    In short, I concur with Guido :-)

    @lysnikolaou
    Copy link
    Contributor Author

    A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b'). This would still output a nicer error message, but would not break code like the one of the example. What do you think about this?

    @aeros
    Copy link
    Contributor

    aeros commented Apr 29, 2020

    For addressing the backwards compatibility concern, I think we should just convert it to a SyntaxWarning for cases like the above to indicate that it's not really correct syntax, but not harmful enough to justify code breakage. I think it fits the documented description of SyntaxWarning well, which is to address "dubious syntax".

    Lysandros Nikolaou wrote:

    A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b'). This would still output a nicer error message, but would not break code like the one of the example. What do you think about this?

    That would certainly help to minimize the breakage, so I'd be in favor of that over a SyntaxError for all invalid prefixes. But, I'm not certain that it's additionally harmful if an invalid string prefix proceeds a valid one. Is there any additional harm, other than from a visual perspective?

    @serhiy-storchaka
    Copy link
    Member

    It's no different from other edge cases where for some reason the space seems optional, like 1and 2.

    The parser is not consistent here. 0or[] is an error, while 0and[] and 1or[] are valid. See https://mail.python.org/archives/list/python-dev@python.org/message/D2WPCITHG2LBQAP7DBTC6CY26WQUBAKP/

    A possible solution would be to only emit a SyntaxError if the NAME directly preceding a STRING token contains one of the valid string prefixes (either one of 'f', 'r', 'u', 'b').

    This would not help for if'a'<=x<='z'. And we will get more breakage when add more string prefixes.

    We should either require a whitespace between an identifier and a string literal (with SyntaxWarning in meantime) or allow to omit it.

    @encukou
    Copy link
    Member

    encukou commented Apr 29, 2020

    reportlab reported failures on code like:

    norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')

    Note that or has a r in it.

    @ericvsmith
    Copy link
    Member

    From earlier in this issue: https://bugs.python.org/msg366164

    So a slightly shorter example uses ru''. This is an error because you can't combine the r prefix and the u prefix (in fact you can't combine anything with the r prefix).

    That's not true: 'rf' is a valid prefix: rf'{1}'. And so are all of the wacky combinations rF, Rf, RF, fr, fR, Fr, FR.

    @serhiy-storchaka
    Copy link
    Member

    I think it was a typo. You cannot combine anything with the u prefix.

    @ericvsmith
    Copy link
    Member

    I think you're right, since rb obviously works, too. I just wanted to make sure we're covering all the bases. There's some code in either the stdlib or the test suite where I generate all of the valid prefixes (it's something like 80 prefixes). I can dig it up if anyone needs it.

    @lysnikolaou
    Copy link
    Contributor Author

    What's also possible is to handle keywords at tokenizer level and return a different token type for each one of them (like with async and await), which is currently handled at parser level. This would enable us to allow reserved keywords right before a STRING token, thus covering all the possible broken code cases, but continue disallowing arbitrary NAMEs, which would mean better error-reporting in the case of invalid prefixes.

    @pablogsal
    Copy link
    Member

    What's also possible is to handle keywords at tokenizer level

    Sadly, the tokenizer is unaware of keywords (maybe except async and await because their history as soft keywords) as that distinction usually is up to the parser as the parser is the one who knows about the grammar. Introducing keywords in the tokenizer would couple them too much IMHO apart from other possible problems.

    @gvanrossum
    Copy link
    Member

    If it's all handled by the tokenizer, how come it's different in the new
    parser?

    Anyway, it seems we really have to support what the old parser supported
    here. I don't know exactly what rules it uses. Maybe only look for a string
    quote following a name that can definitely be a string prefix?

    @pablogsal
    Copy link
    Member

    If it's all handled by the tokenizer, how come it's different in the newparser?

    Is not different in the new parser: both parsers have analogous behaviour now:

    ~/github/python/master master
    ❯ ./python
    Python 3.9.0a6+ (heads/master:84724dd239, Apr 29 2020, 20:26:52)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
      File "<stdin>", line 1
        norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
                                             ^
    SyntaxError: invalid string prefix
    >>>
    
    ~/github/python/master master
    ❯ ./python -X oldparser
    Python 3.9.0a6+ (heads/master:84724dd239, Apr 29 2020, 20:26:52)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
      File "<stdin>", line 1
        norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
                                           ^
    SyntaxError: invalid string prefix

    This issue is exclusively due to the changes in #19476 if I understand correctly.

    @pablogsal
    Copy link
    Member

    Indeed, reverting commit 41d5b94 makes it work with both parsers:

    ~/github/python/master master*
    ❯ ./python -X oldparser
    Python 3.9.0a6+ (heads/master-dirty:84724dd239, Apr 29 2020, 20:29:53)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
    >>>
    
    ~/github/python/master master*
    ❯ ./python
    Python 3.9.0a6+ (heads/master-dirty:84724dd239, Apr 29 2020, 20:29:53)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> norm=lambda m: m+(m and(m[-1]!='\n'and'\n'or'')or'\n')
    >>>

    @serhiy-storchaka
    Copy link
    Member

    The original issue was about different error messages in REPL and eval(). But it is not related to string prefixes. We have the same difference without involving strings:

    >>> a b
      File "<stdin>", line 1
        a b
          ^
    SyntaxError: invalid syntax
    >>> eval("a b")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
        a b
          ^
    SyntaxError: unexpected EOF while parsing

    I suggest to revert this change and close the issue.

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented May 3, 2020

    We also get:

    File "/usr/lib/python3.9/site-packages/dnf/comps.py", line 111
    return'.'.join(lcl)
    ^
    SyntaxError: invalid string prefix

    @pablogsal
    Copy link
    Member

    New changeset 846d8b2 by Lysandros Nikolaou in branch 'master':
    bpo-40246: Revert reporting of invalid string prefixes (GH-19888)
    846d8b2

    @lysnikolaou
    Copy link
    Contributor Author

    The revert is in. Now the question is if we want to take additional action to address the original issue of this.

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented May 4, 2020

    I will soon come back with what Fedora package were affected by the problem. That could give some data about how to handle this.

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented May 4, 2020

    Not that many:

    cpython itself (fixed via PR)
    demjson (fixed via PR)
    asn1crypto (fixed via PR)
    dnf (fixed via PR)
    freeipa (fixed via PR)

    I gave up sending PRs at this point.

    waf
    weasyprint
    virt-who
    thrift
    salt
    wxpython4
    rosdistro
    mne
    pycairo
    libstoragemgmt (possibly via bundled lsm/external/xmltodict)
    dput-ng
    ddupdate
    aubio (via bundled waf)

    @vstinner
    Copy link
    Member

    vstinner commented May 4, 2020

    Lysandros:

    The revert is in. Now the question is if we want to take additional action to address the original issue of this.

    If someone cares of that, I suggest to open an issue in pylint, pyflakes and similar tools to emit a warning in linters.

    @lysnikolaou
    Copy link
    Contributor Author

    Ok, I'm closing this, after consulting with Guido.

    @vstinner
    Copy link
    Member

    The following change broke test_fstring when using the old parser. I reopen the issue.

    commit 846d8b2 (refs/bisect/bad)
    Author: Lysandros Nikolaou <lisandrosnik@gmail.com>
    Date: Mon May 4 14:32:18 2020 +0300

    bpo-40246: Revert reporting of invalid string prefixes (GH-19888)
    
    Due to backwards compatibility concerns regarding keywords immediately followed by a string without whitespace between them (like in `bg="#d00" if clear else"#fca"`) will fail to parse,
    commit 41d5b94af44e34ac05d4cd57460ed104ccf96628 has to be reverted.
    
    $ ./python -X oldparser -m test -v test_fstring 
    (...)

    ======================================================================
    FAIL: test_invalid_string_prefixes (test.test_fstring.TestCase) (str='BF""')
    ----------------------------------------------------------------------
    File "<string>", line 1
    BF""
    ^
    SyntaxError: invalid syntax

    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/vstinner/python/master/Lib/test/test_fstring.py", line 29, in assertAllRaise
        eval(str)
    AssertionError: "unexpected EOF while parsing" does not match "invalid syntax (<string>, line 1)"
    (...)

    @vstinner vstinner reopened this May 15, 2020
    @vstinner
    Copy link
    Member

    The error can be seen on the new cool AMD64 Arch Linux VintageParser 3.x, vintage is the new cool:
    https://buildbot.python.org/all/#builders/648/builds/185

    @gvanrossum
    Copy link
    Member

    We should run the tests with the old parser in at least one build.

    @terryjreedy
    Copy link
    Member

    The >else"#fca"< code was added 6 years ago (commit 8b95d5e)

    That was my typo in turtledemo.__main__ (similar lines before had the space) and I wish that it had been caught then.

    Why did compileall in the test suite not catch this SyntaxError before 3.6.0a6 was released? Does it intentionally skip .__main__ files? Should I add a test_turtledemo file (with checks for the presence of tkinter and turtle)?

    @vstinner
    Copy link
    Member

    "make install" ignores compileall errors no? If I recall correctly, only an error is logged into stdlib, "make install" doesn't fail.

    @pablogsal
    Copy link
    Member

    New changeset 6cb0ad2 by Lysandros Nikolaou in branch '3.9':
    bpo-40246: Fix test_fstring when run with the old parser (GH-20402)
    6cb0ad2

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants