This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author eryksun
Recipients Andrew Ushakov, eryksun, serhiy.storchaka, terry.reedy
Date 2021-04-13.09:37:26
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1618306646.43.0.0976888456209.issue38755@roundup.psfhosted.org>
In-reply-to
Content
> P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.

The issue is that the line length is limited to BUFSIZ, which ends up splitting the UTF-8 sequence b'\xe2\x96\x91'. BUFSIZ is only 512 bytes in Windows. It's 8192 bytes in Linux, in which case you need a line that's 16 times longer in order to reproduce the error. For example:

    $ stat -c "%s" test.py 
    8194
    $ python3.9 test.py
    SyntaxError: Non-UTF-8 code starting with '\xe2' in file 
    /home/someone/test.py on line 1, but no encoding declared; see 
    http://python.org/dev/peps/pep-0263/ for details

This has been fixed in a rewrite of the tokenizer (bpo-25643), for which the PR was recently merged into the main branch for 3.10a7+.

Maybe a minimal backport to keep reading up to "\n" can be applied to 3.8 and 3.9.
History
Date User Action Args
2021-04-13 09:37:26eryksunsetrecipients: + eryksun, terry.reedy, serhiy.storchaka, Andrew Ushakov
2021-04-13 09:37:26eryksunsetmessageid: <1618306646.43.0.0976888456209.issue38755@roundup.psfhosted.org>
2021-04-13 09:37:26eryksunlinkissue38755 messages
2021-04-13 09:37:26eryksuncreate