This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author steve
Recipients ezio.melotti, steve, vstinner
Date 2018-03-02.23:32:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1520033569.84.0.467229070634.issue32987@psf.upfronthosting.co.za>
In-reply-to
Content
Here is an example involving the unicode character MIDDLE DOT · : The line

ab·cd = 7

is valid Python 3 code and is happily accepted by the CPython interpreter. However, tokenize.py does not like it. It says that the middle-dot is an error token. Here is an example you can run to see that:

    import tokenize
    from io import BytesIO
    
    test = 'ab·cd = 7'.encode('utf-8')
    
    x = tokenize.tokenize(BytesIO(test).readline)
    for i in x: print(i)

For reference, the official definition of identifiers is: 

https://docs.python.org/3.6/reference/lexical_analysis.html#identifiers

and details about MIDDLE DOT are at

https://www.unicode.org/Public/10.0.0/ucd/PropList.txt

MIDDLE DOT has the "Other_ID_Continue" property, so I think the interpreter is behaving correctly (i.e. consistent with the documented spec), while tokenize.py is wrong.
History
Date User Action Args
2018-03-02 23:32:49stevesetrecipients: + steve, vstinner, ezio.melotti
2018-03-02 23:32:49stevesetmessageid: <1520033569.84.0.467229070634.issue32987@psf.upfronthosting.co.za>
2018-03-02 23:32:49stevelinkissue32987 messages
2018-03-02 23:32:49stevecreate