This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Long unicode string causes SyntaxError: Non-UTF-8 code starting with '\xe2' in file ..., but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Type: behavior Stage: needs patch
Components: Interpreter Core Versions: Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Andrew Ushakov, eryksun, serhiy.storchaka, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2019-11-09 12:26 by Andrew Ushakov, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
tst112.py Andrew Ushakov, 2019-11-09 12:26
Messages (7)
msg356298 - (view) Author: Andrew Ushakov (Andrew Ushakov) Date: 2019-11-09 12:26
Not very long unicode comment #, space and then 170 or more repetitions of the utf8 symbol ░ (b'\xe2\x96\x91'.decode()) 

# ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

causes syntax error:

SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Python file is attached. Second example is similar, but here unicode string with similar length is used as an argument of a print function.

print('\n░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░')

Similar Issue34979 was submitted one year ago...
msg356709 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-11-15 19:45
I think that this should be closed as a duplicate of #34979 and this example posted there, with the OS and python version included.

On Windows, with 3.7, 3.8.0, and master, neither the posted comment, the one in the file, not the initial statement in #34979 give the SyntaxError.
msg356715 - (view) Author: Andrew Ushakov (Andrew Ushakov) Date: 2019-11-15 20:16
> On Windows, with 3.7, 3.8.0, and master, neither the posted comment, the one in the file, not the initial statement in #34979 give the SyntaxError.

Just tried again on my corporate laptop with the downloaded file from this site:

Microsoft Windows [Version 10.0.16299.1451]
(c) 2017 Microsoft Corporation. All rights reserved.

D:\Downloads>py
Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()

D:\Downloads>py tst112.py
  File "tst112.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

d:\Downloads>py -3.7
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()

d:\Downloads>py -3.7 tst112.py
  File "tst112.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
msg390931 - (view) Author: Andrew Ushakov (Andrew Ushakov) Date: 2021-04-13 07:09
Just tested again:

D:\Downloads>py                                                                                                                                           
Python 3.9.4 (tags/v3.9.4:1f2e308, Apr  4 2021, 13:27:16) [MSC v.1928 64 bit (AMD64)] on win32                                                            
Type "help", "copyright", "credits" or"license" for more information.                                                                                    
>>> quit()
                                                                                                                                                                                                                                                                                                          D:\Downloads>py tst112.py                                                                                                                                 
SyntaxError: Non-UTF-8 code starting with '\xe2' in file D:\Downloads\tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details 

P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.
msg390942 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-04-13 09:37
> P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.

The issue is that the line length is limited to BUFSIZ, which ends up splitting the UTF-8 sequence b'\xe2\x96\x91'. BUFSIZ is only 512 bytes in Windows. It's 8192 bytes in Linux, in which case you need a line that's 16 times longer in order to reproduce the error. For example:

    $ stat -c "%s" test.py 
    8194
    $ python3.9 test.py
    SyntaxError: Non-UTF-8 code starting with '\xe2' in file 
    /home/someone/test.py on line 1, but no encoding declared; see 
    http://python.org/dev/peps/pep-0263/ for details

This has been fixed in a rewrite of the tokenizer (bpo-25643), for which the PR was recently merged into the main branch for 3.10a7+.

Maybe a minimal backport to keep reading up to "\n" can be applied to 3.8 and 3.9.
msg391018 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-13 23:52
The bpo-14811 issue was fixed in Python 3.10 by bpo-25643, but is not fixed in Python 3.8 and 3.9.
msg391019 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-13 23:54
In 2012, I wrote detect_truncate.patch in bpo-14811. Does someone want to convert it to a PR for Python 3.9?
History
Date User Action Args
2022-04-11 14:59:23adminsetgithub: 82936
2021-04-13 23:54:10vstinnersetmessages: + msg391019
2021-04-13 23:52:57vstinnersetnosy: + vstinner
messages: + msg391018
2021-04-13 09:37:59eryksunsetstage: test needed -> needs patch
versions: - Python 3.7
2021-04-13 09:37:26eryksunsetnosy: + eryksun
messages: + msg390942
2021-04-13 07:09:58Andrew Ushakovsetmessages: + msg390931
versions: + Python 3.7, Python 3.9
2019-11-15 20:16:55Andrew Ushakovsetmessages: + msg356715
2019-11-15 19:45:45terry.reedysetnosy: + terry.reedy
messages: + msg356709

type: behavior
stage: test needed
2019-11-09 12:43:01serhiy.storchakasetnosy: + serhiy.storchaka
2019-11-09 12:26:48Andrew Ushakovcreate