Issue 14811: decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59016

classification

Title:	decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.9, Python 3.8

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Python tokenizer rewriting View: 25643
Assigned To:		Nosy List:	BTaskaya, brian.curtin, eryksun, ezio.melotti, hynek, lys.nikolaou, pablogsal, pitrou, r.david.murray, serhiy.storchaka, tim.golden, v+python, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-05-15 04:31 by v+python, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
t33a.py	v+python, 2012-05-15 04:31	test case demonstrating bug
detect_truncate.patch	vstinner, 2012-05-16 07:00		review

Messages (20)
msg160679 - (view)	Author: Glenn Linderman (v+python) *	Date: 2012-05-15 04:31
t33a.py demonstrates a compilation problem. OK, it has a long line, but making it one space longer (add a space after the left parenthesis) makes it work... so it must not be line length alone. Rather, since the error is about a bad UTF-8 character starting with \xc3, it seems that the UTF-8 decoder might play a role. I was surprised that I could reduce the test case by removing all the lines before and after these 3: the original failure was in a much longer file to which I added this line. Originally detected in 3.2.2, I upgraded to 3.2.3 and the problem still occurred.
msg160686 - (view)	Author: Glenn Linderman (v+python) *	Date: 2012-05-15 06:25
Forgot to mention that I was running on Windows, 64-bit.
msg160688 - (view)	Author: Hynek Schlawack (hynek) *	Date: 2012-05-15 06:45
Would you mind adding more information like the full traceback? By saying "compilation error", I presume you mean the compilation of the t33a.py file into byte code (and not compilation of Python itself)? I can't reproduce it neither with the vanilla 3.2.3 on OS X nor with Ubuntu's 3.2. My only suspicion is that the platform default encoding has bitten you, does it also crash if you add "# -- coding: utf-8 --" as the first line?
msg160697 - (view)	Author: Glenn Linderman (v+python) *	Date: 2012-05-15 08:54
There is no traceback. Here is the text of the Syntax error. d:\my\im\infiles>c:\python32\python.exe d:\my\py\t33a.py -h File "d:\my\py\t33a.py", line 2 SyntaxError: Non-UTF-8 code starting with '\xc3' in file d:\my\py\t33a.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details My understanding is Python 3 uses utf-8 as the default encoding for source files -- unless there is an encoding line; and I've set my emacs to save all .py files as utf-8-unix (meaning with no CR, if you aren't an emacs user). I verified with a hex dump that the encoding in the file is UTF-8, but you are welcome to also, that is the file I uploaded. So your testing would seem to indicate it is a platform specific bug. Try running it on Windows, then. Further, if it were the platform default encoding, adding a space wouldn't cure it... the encoding of the file would still be UTF-8, and the platform default encoding would still be the same whatever you think it might be (but I think it is UTF-8 for source text), so adding a space would not effect an encoding mismatch.
msg160701 - (view)	Author: Hynek Schlawack (hynek) *	Date: 2012-05-15 09:45
You are right, file system encoding was platform dependent, not file encoding. This space-after-parentheses trigger is odd; I'm adding the Windows guys to the ticket. Please tell us also your exact version of Windows.
msg160705 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-15 10:23
I tried to reproduce but failed to compile a Windows Python - see issue14813.
msg160706 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-15 10:40
I can reproduce it on Linux. Minimal example: $ ./python -c "open('longline.py', 'w').write('#' + repr('\u00A1' * 4096) + '\n')" $ ./python longline.py File "longline.py", line 1 SyntaxError: Non-UTF-8 code starting with '\xc2' in file longline.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
msg160708 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-15 10:42
And for Python 2.7 too.
msg160709 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-15 10:49
Function decoding_fgets (Parser/tokenizer.c) reads line in buffer of fixed size 8192 (line truncated to size 8191) and then fails because line is cut in the middle of a multibyte UTF-8 character.
msg160767 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-05-15 21:35
By the way, Glenn, what you posted as "the syntax error" (which it was) is the traceback. A syntax error on the file directly being compiled will only have one line in the traceback.
msg160772 - (view)	Author: Glenn Linderman (v+python) *	Date: 2012-05-15 22:31
Thanks, David, for the clarification. I had been mentally separating syntax errors from other errors.
msg160807 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-16 07:00
> Function decoding_fgets (Parser/tokenizer.c) reads line in buffer > of fixed size 8192 (line truncated to size 8191) and then fails > because line is cut in the middle of a multibyte UTF-8 character. It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes. Attached patch detects when a line is truncated (longer than the internal buffer). A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py). At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?). I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program. PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.
msg165841 - (view)	Author: Hynek Schlawack (hynek) *	Date: 2012-07-19 14:18
Are we going to fix this before 3.3? Any objections to Victor's patch?
msg167154 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-01 18:02
> Are we going to fix this before 3.3? Any objections to Victor's patch? detect_truncate.patch is now raising an error if a line is longer than BUFSIZ, whereas Python supports lines longer than BUFSIZ bytes (it's just that the encoding cookie is ignored if the line 1 or 2 is longer than BUFSIZ bytes). So my patch is not correct.
msg390969 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-04-13 14:57
I don't get any error executing the t33a.py script
msg390974 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-04-13 16:14
> I don't get any error executing the t33a.py script The second line in t33a.py is 1618 bytes. The standard I/O BUFSIZ in Linux is 8192 bytes, but it's only 512 bytes in Windows. The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows.
msg390975 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-04-13 16:42
> no longer fails in Windows. So that means we can close the issue, no?
msg390978 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-04-13 17:07
With https://bugs.python.org/issue14811#msg160706 I get a SyntaxError on Python 3.7, 3.8, 3.9 and 3.10.0a6. But I don't get an error on the master branch (Python 3.10.0a7+). Eryk: > The latest alpha release, 3.10a7, includes your rewrite of the tokenizer, and in that case t33a.py no longer fails in Windows. Oh ok, this issue was fixed by the following commit which is part of v3.10.0a7 release: commit 261a452a1300eeeae1428ffd6e6623329c085e2c Author: Pablo Galindo <Pablogsal@gmail.com> Date: Sun Mar 28 23:48:05 2021 +0100 bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
msg391015 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-04-13 23:29
> So that means we can close the issue, no? This is a bug in 3.8 and 3.9, which need the fix to keep reading until "\n" is seen on the line. I arrived at this issue via bpo-38755 if you think it should be addressed there, but it's the same bug that's reported here.
msg391017 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2021-04-13 23:35
Ok, let's continue the discussion on https://bugs.python.org/issue38755

History
Date	User	Action	Args
2022-04-11 14:57:30	admin	set	github: 59016
2021-04-13 23:35:33	pablogsal	set	messages: + msg391017
2021-04-13 23:29:45	eryksun	set	messages: + msg391015
2021-04-13 17:07:04	vstinner	set	status: open -> closed superseder: Python tokenizer rewriting messages: + msg390978 resolution: duplicate stage: needs patch -> resolved
2021-04-13 16:42:16	pablogsal	set	messages: + msg390975
2021-04-13 16:14:07	eryksun	set	nosy: + eryksun messages: + msg390974
2021-04-13 14:57:45	pablogsal	set	messages: + msg390969
2021-04-13 14:10:37	vstinner	set	nosy: + lys.nikolaou, pablogsal, BTaskaya
2021-04-13 10:15:57	eryksun	set	versions: + Python 3.8, Python 3.9, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2012-11-04 17:05:41	serhiy.storchaka	set	stage: needs patch versions: + Python 3.4
2012-08-01 18:02:59	vstinner	set	messages: + msg167154
2012-07-19 14:18:22	hynek	set	messages: + msg165841
2012-05-16 07:00:29	vstinner	set	files: + detect_truncate.patch components: + Interpreter Core, - Windows title: Syntax error on long UTF-8 lines -> decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...") keywords: + patch nosy: + vstinner messages: + msg160807
2012-05-15 22:31:48	v+python	set	messages: + msg160772
2012-05-15 21:35:42	r.david.murray	set	nosy: + r.david.murray messages: + msg160767
2012-05-15 10:51:34	serhiy.storchaka	set	title: compile fails - UTF-8 character decoding -> Syntax error on long UTF-8 lines
2012-05-15 10:49:22	serhiy.storchaka	set	messages: + msg160709
2012-05-15 10:42:53	serhiy.storchaka	set	messages: + msg160708 versions: + Python 2.7
2012-05-15 10:40:20	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg160706
2012-05-15 10:23:18	pitrou	set	versions: + Python 3.3 nosy: + pitrou messages: + msg160705 components: + Windows
2012-05-15 09:45:50	hynek	set	nosy: + tim.golden, brian.curtin messages: + msg160701 components: - Interpreter Core type: compile error -> behavior
2012-05-15 08:54:43	v+python	set	messages: + msg160697
2012-05-15 06:45:54	hynek	set	nosy: + hynek messages: + msg160688
2012-05-15 06:32:37	ezio.melotti	set	nosy: + ezio.melotti components: + Unicode
2012-05-15 06:25:49	v+python	set	messages: + msg160686
2012-05-15 04:31:40	v+python	create