classification
Title: decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")
Type: behavior Stage: needs patch
Components: Interpreter Core, Unicode Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: brian.curtin, ezio.melotti, hynek, pitrou, r.david.murray, serhiy.storchaka, tim.golden, v+python, vstinner
Priority: normal Keywords: patch

Created on 2012-05-15 04:31 by v+python, last changed 2012-11-04 17:05 by serhiy.storchaka.

Files
File name Uploaded Description Edit
t33a.py v+python, 2012-05-15 04:31 test case demonstrating bug
detect_truncate.patch vstinner, 2012-05-16 07:00 review
Messages (14)
msg160679 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 04:31
t33a.py demonstrates a compilation problem.  OK, it has a long line, but making it one space longer (add a space after the left parenthesis) makes it work... so it must not be line length alone.  Rather, since the error is about a bad UTF-8 character starting with \xc3, it seems that the UTF-8 decoder might play a role.  I was surprised that I could reduce the test case by removing all the lines before and after these 3: the original failure was in a much longer file to which I added this line.

Originally detected in 3.2.2, I upgraded to 3.2.3 and the problem still occurred.
msg160686 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 06:25
Forgot to mention that I was running on Windows, 64-bit.
msg160688 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-05-15 06:45
Would you mind adding more information like the full traceback? By saying "compilation error", I presume you mean the compilation of the t33a.py file into byte code (and not compilation of Python itself)?

I can't reproduce it neither with the vanilla 3.2.3 on OS X nor with Ubuntu's 3.2.

My only suspicion is that the platform default encoding has bitten you, does it also crash if you add "# -*- coding: utf-8 -*-" as the first line?
msg160697 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 08:54
There is no traceback.  Here is the text of the Syntax error.

d:\my\im\infiles>c:\python32\python.exe d:\my\py\t33a.py -h
  File "d:\my\py\t33a.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xc3' in file d:\my\py\t33a.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

My understanding is Python 3 uses utf-8 as the default encoding for source files -- unless there is an encoding line; and I've set my emacs to save all .py files as utf-8-unix (meaning with no CR, if you aren't an emacs user).

I verified with a hex dump that the encoding in the file is UTF-8, but you are welcome to also, that is the file I uploaded.

So your testing would seem to indicate it is a platform specific bug.  Try running it on Windows, then.

Further, if it were the platform default encoding, adding a space wouldn't cure it... the encoding of the file would still be UTF-8, and the platform default encoding would still be the same whatever you think it might be (but I think it is UTF-8 for source text), so adding a space would not effect an encoding mismatch.
msg160701 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-05-15 09:45
You are right, file system encoding was platform dependent, not file encoding.

This space-after-parentheses trigger is odd; I'm adding the Windows guys to the ticket. Please tell us also your exact version of Windows.
msg160705 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-15 10:23
I tried to reproduce but failed to compile a Windows Python - see issue14813.
msg160706 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 10:40
I can reproduce it on Linux. Minimal example:

$ ./python -c "open('longline.py', 'w').write('#' + repr('\u00A1' * 4096) + '\n')"
$ ./python longline.py
  File "longline.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xc2' in file longline.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
msg160708 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 10:42
And for Python 2.7 too.
msg160709 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 10:49
Function decoding_fgets (Parser/tokenizer.c) reads line in buffer of fixed size 8192 (line truncated to size 8191) and then fails because line is cut in the middle of a multibyte UTF-8 character.
msg160767 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-05-15 21:35
By the way, Glenn, what you posted as "the syntax error" (which it was) *is* the traceback.  A syntax error on the file directly being compiled will only have one line in the traceback.
msg160772 - (view) Author: Glenn Linderman (v+python) * Date: 2012-05-15 22:31
Thanks, David, for the clarification. I had been mentally separating 
syntax errors from other errors.
msg160807 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-16 07:00
> Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
> of fixed size 8192 (line truncated to size 8191) and then fails
> because line is cut in the middle of a multibyte UTF-8 character.

It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes.

Attached patch detects when a line is truncated (longer than the internal buffer).

A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?).

I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program.

PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.
msg165841 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-07-19 14:18
Are we going to fix this before 3.3? Any objections to Victor's patch?
msg167154 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-08-01 18:02
> Are we going to fix this before 3.3? Any objections to Victor's patch?

detect_truncate.patch is now raising an error if a line is longer than BUFSIZ, whereas Python supports lines longer than BUFSIZ bytes (it's just that the encoding cookie is ignored if the line 1 or 2 is longer than BUFSIZ bytes). So my patch is not correct.
History
Date User Action Args
2012-11-04 17:05:41serhiy.storchakasetstage: needs patch
versions: + Python 3.4
2012-08-01 18:02:59vstinnersetmessages: + msg167154
2012-07-19 14:18:22hyneksetmessages: + msg165841
2012-05-16 07:00:29vstinnersetfiles: + detect_truncate.patch

components: + Interpreter Core, - Windows
title: Syntax error on long UTF-8 lines -> decoding_fgets() truncates long lines and fails with a SyntaxError("Non-UTF-8 code starting with...")
keywords: + patch
nosy: + vstinner

messages: + msg160807
2012-05-15 22:31:48v+pythonsetmessages: + msg160772
2012-05-15 21:35:42r.david.murraysetnosy: + r.david.murray
messages: + msg160767
2012-05-15 10:51:34serhiy.storchakasettitle: compile fails - UTF-8 character decoding -> Syntax error on long UTF-8 lines
2012-05-15 10:49:22serhiy.storchakasetmessages: + msg160709
2012-05-15 10:42:53serhiy.storchakasetmessages: + msg160708
versions: + Python 2.7
2012-05-15 10:40:20serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg160706
2012-05-15 10:23:18pitrousetversions: + Python 3.3
nosy: + pitrou

messages: + msg160705

components: + Windows
2012-05-15 09:45:50hyneksetnosy: + tim.golden, brian.curtin
messages: + msg160701

components: - Interpreter Core
type: compile error -> behavior
2012-05-15 08:54:43v+pythonsetmessages: + msg160697
2012-05-15 06:45:54hyneksetnosy: + hynek
messages: + msg160688
2012-05-15 06:32:37ezio.melottisetnosy: + ezio.melotti
components: + Unicode
2012-05-15 06:25:49v+pythonsetmessages: + msg160686
2012-05-15 04:31:40v+pythoncreate