This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.9, Python 3.8
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Python tokenizer rewriting
View: 25643
Assigned To: Nosy List: BTaskaya, brian.curtin, eryksun, ezio.melotti, hynek, lys.nikolaou, pablogsal, pitrou, r.david.murray, serhiy.storchaka, tim.golden, v+python, vstinner
Priority: normal Keywords: patch

Created on 2012-05-15 04:31 by v+python, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
t33a.py v+python, 2012-05-15 04:31 test case demonstrating bug
detect_truncate.patch vstinner, 2012-05-16 07:00 review
Messages (20)
msg160679 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 04:31
t33a.py demonstrates a compilation problem.  OK, it has a long line, but making it one space longer (add a space after the left parenthesis) makes it work... so it must not be line length alone.  Rather, since the error is about a bad UTF-8 character starting with \xc3, it seems that the UTF-8 decoder might play a role.  I was surprised that I could reduce the test case by removing all the lines before and after these 3: the original failure was in a much longer file to which I added this line.

Originally detected in 3.2.2, I upgraded to 3.2.3 and the problem still occurred.
msg160686 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 06:25
Forgot to mention that I was running on Windows, 64-bit.
msg160688 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-05-15 06:45
Would you mind adding more information like the full traceback? By saying "compilation error", I presume you mean the compilation of the t33a.py file into byte code (and not compilation of Python itself)?

I can't reproduce it neither with the vanilla 3.2.3 on OS X nor with Ubuntu's 3.2.

My only suspicion is that the platform default encoding has bitten you, does it also crash if you add "# -*- coding: utf-8 -*-" as the first line?
msg160697 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 08:54
There is no traceback.  Here is the text of the Syntax error.

d:\my\im\infiles>c:\python32\python.exe d:\my\py\t33a.py -h
  File "d:\my\py\t33a.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xc3' in file d:\my\py\t33a.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

My understanding is Python 3 uses utf-8 as the default encoding for source files -- unless there is an encoding line; and I've set my emacs to save all .py files as utf-8-unix (meaning with no CR, if you aren't an emacs user).

I verified with a hex dump that the encoding in the file is UTF-8, but you are welcome to also, that is the file I uploaded.

So your testing would seem to indicate it is a platform specific bug.  Try running it on Windows, then.

Further, if it were the platform default encoding, adding a space wouldn't cure it... the encoding of the file would still be UTF-8, and the platform default encoding would still be the same whatever you think it might be (but I think it is UTF-8 for source text), so adding a space would not effect an encoding mismatch.
msg160701 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-05-15 09:45
You are right, file system encoding was platform dependent, not file encoding.

This space-after-parentheses trigger is odd; I'm adding the Windows guys to the ticket. Please tell us also your exact version of Windows.
msg160705 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-15 10:23
I tried to reproduce but failed to compile a Windows Python - see issue14813.
msg160706 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 10:40
I can reproduce it on Linux. Minimal example:

$ ./python -c "open('longline.py', 'w').write('#' + repr('\u00A1' * 4096) + '\n')"
$ ./python longline.py
  File "longline.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xc2' in file longline.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
msg160708 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 10:42
And for Python 2.7 too.
msg160709 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 10:49
Function decoding_fgets (Parser/tokenizer.c) reads line in buffer of fixed size 8192 (line truncated to size 8191) and then fails because line is cut in the middle of a multibyte UTF-8 character.
msg160767 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-15 21:35
By the way, Glenn, what you posted as "the syntax error" (which it was) *is* the traceback.  A syntax error on the file directly being compiled will only have one line in the traceback.
msg160772 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 22:31
Thanks, David, for the clarification. I had been mentally separating 
syntax errors from other errors.
msg160807 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-16 07:00
> Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
> of fixed size 8192 (line truncated to size 8191) and then fails
> because line is cut in the middle of a multibyte UTF-8 character.

It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes.

Attached patch detects when a line is truncated (longer than the internal buffer).

A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?).

I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program.

PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.
msg165841 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-07-19 14:18
Are we going to fix this before 3.3? Any objections to Victor's patch?
msg167154 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-01 18:02
> Are we going to fix this before 3.3? Any objections to Victor's patch?

detect_truncate.patch is now raising an error if a line is longer than BUFSIZ, whereas Python supports lines longer than BUFSIZ bytes (it's just that the encoding cookie is ignored if the line 1 or 2 is longer than BUFSIZ bytes). So my patch is not correct.
msg390969 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-04-13 14:57
I don't get any error executing the t33a.py script
msg390974 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-04-13 16:14
> I don't get any error executing the t33a.py script

The second line in t33a.py is 1618 bytes. The standard I/O BUFSIZ in Linux is 8192 bytes, but it's only 512 bytes in Windows. The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.
msg390975 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-04-13 16:42
> no longer fails in Windows.

So that means we can close the issue, no?
msg390978 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-13 17:07
With https://bugs.python.org/issue14811#msg160706 I get a SyntaxError on Python 3.7, 3.8, 3.9 and 3.10.0a6. But I don't get an error on the master branch (Python 3.10.0a7+).

Eryk:
> The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.

Oh ok, this issue was fixed by the following commit which is part of v3.10.0a7 release:

commit 261a452a1300eeeae1428ffd6e6623329c085e2c
Author: Pablo Galindo <Pablogsal@gmail.com>
Date:   Sun Mar 28 23:48:05 2021 +0100

    bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
msg391015 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-04-13 23:29
> So that means we can close the issue, no?

This is a bug in 3.8 and 3.9, which need the fix to keep reading until "\n" is seen on the line. I arrived at this issue via bpo-38755 if you think it should be addressed there, but it's the same bug that's reported here.
msg391017 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-04-13 23:35
Ok, let's continue the discussion on https://bugs.python.org/issue38755
History
Date User Action Args
2022-04-11 14:57:30adminsetgithub: 59016
2021-04-13 23:35:33pablogsalsetmessages: + msg391017
2021-04-13 23:29:45eryksunsetmessages: + msg391015
2021-04-13 17:07:04vstinnersetstatus: open -> closed
superseder: Python tokenizer rewriting
messages: + msg390978

resolution: duplicate
stage: needs patch -> resolved
2021-04-13 16:42:16pablogsalsetmessages: + msg390975
2021-04-13 16:14:07eryksunsetnosy: + eryksun
messages: + msg390974
2021-04-13 14:57:45pablogsalsetmessages: + msg390969
2021-04-13 14:10:37vstinnersetnosy: + lys.nikolaou, pablogsal, BTaskaya
2021-04-13 10:15:57eryksunsetversions: + Python 3.8, Python 3.9, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2012-11-04 17:05:41serhiy.storchakasetstage: needs patch
versions: + Python 3.4
2012-08-01 18:02:59vstinnersetmessages: + msg167154
2012-07-19 14:18:22hyneksetmessages: + msg165841
2012-05-16 07:00:29vstinnersetfiles: + detect_truncate.patch

components: + Interpreter Core, - Windows
title: Syntax error on long UTF-8 lines -> decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")
keywords: + patch
nosy: + vstinner

messages: + msg160807
2012-05-15 22:31:48v+pythonsetmessages: + msg160772
2012-05-15 21:35:42r.david.murraysetnosy: + r.david.murray
messages: + msg160767
2012-05-15 10:51:34serhiy.storchakasettitle: compile fails - UTF-8 character decoding -> Syntax error on long UTF-8 lines
2012-05-15 10:49:22serhiy.storchakasetmessages: + msg160709
2012-05-15 10:42:53serhiy.storchakasetmessages: + msg160708
versions: + Python 2.7
2012-05-15 10:40:20serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg160706
2012-05-15 10:23:18pitrousetversions: + Python 3.3
nosy: + pitrou

messages: + msg160705

components: + Windows
2012-05-15 09:45:50hyneksetnosy: + tim.golden, brian.curtin
messages: + msg160701

components: - Interpreter Core
type: compile error -> behavior
2012-05-15 08:54:43v+pythonsetmessages: + msg160697
2012-05-15 06:45:54hyneksetnosy: + hynek
messages: + msg160688
2012-05-15 06:32:37ezio.melottisetnosy: + ezio.melotti
components: + Unicode
2012-05-15 06:25:49v+pythonsetmessages: + msg160686
2012-05-15 04:31:40v+pythoncreate