Message 160807 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	brian.curtin, ezio.melotti, hynek, pitrou, r.david.murray, serhiy.storchaka, tim.golden, v+python, vstinner
Date	2012-05-16.07:00:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1337151630.09.0.953377453553.issue14811@psf.upfronthosting.co.za>
In-reply-to

Content
> Function decoding_fgets (Parser/tokenizer.c) reads line in buffer > of fixed size 8192 (line truncated to size 8191) and then fails > because line is cut in the middle of a multibyte UTF-8 character. It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes. Attached patch detects when a line is truncated (longer than the internal buffer). A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py). At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?). I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program. PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.

> Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
> of fixed size 8192 (line truncated to size 8191) and then fails
> because line is cut in the middle of a multibyte UTF-8 character.

It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 bytes.

Attached patch detects when a line is truncated (longer than the internal buffer).

A better solution is maybe to reallocate the buffer if the string is longer than the buffer (write a universal fgets which allocates the buffer while the line is read). Most functions parsing Python source code uses a dynamic buffer. For example "import module" now reads the whole file content before parsing it (see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

At least, we should use a longer buffer on Windows (ex: use 8192 on all platforms?).

I only found two functions parsing the a Python file line by line: PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are part of the C Python API and used by programs to execute Python code when Python is embeded in a program.

PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's just that the internal buffer is much smaller on Windows.

History
Date	User	Action	Args
2012-05-16 07:00:30	vstinner	set	recipients: + vstinner, pitrou, tim.golden, ezio.melotti, v+python, r.david.murray, brian.curtin, hynek, serhiy.storchaka
2012-05-16 07:00:30	vstinner	set	messageid: <1337151630.09.0.953377453553.issue14811@psf.upfronthosting.co.za>
2012-05-16 07:00:29	vstinner	link	issue14811 messages
2012-05-16 07:00:29	vstinner	create