This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Patch for potential buffer overrun in tokenizer.c
Type: Stage:
Components: Interpreter Core Versions: Python 2.4
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: loewis Nosy List: doerwalter, glchapman, loewis
Priority: high Keywords: patch

Created on 2005-01-13 15:45 by glchapman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit
tokenizer.c.3.diff glchapman, 2005-02-27 15:22 patch without utf8 manipulation glchapman, 2005-02-27 15:23 test files glchapman, 2005-03-21 13:47
tokenizer.c.4.diff doerwalter, 2005-05-16 14:11
Messages (9)
msg47529 - (view) Author: Greg Chapman (glchapman) Date: 2005-01-13 15:45
The fp_readl function in tokenizer.c has a potential
buffer overrun; see:

It is also triggered by trying to generate the pycom
file for the Excel 11.0 typelib, which is where I ran
into it.

The attached patch allows successful generation of the
Excel file; it also runs the script from the
above report without an access violation.  It also
doesn't break any of the tests in the regression suite
(probably something like should be added as a

It is not as efficient as it might be; with a function
for determining the number of unicode characters in a
utf8 string, you could avoid some memory allocations. 
Perhaps such a function should be added to unicodeobject.c?

And, of course, the patch definitely needs review.  I'm
especially concerned that my use of
tok->decoding_buffer might be violating some kind of
assumption that I missed.
msg47530 - (view) Author: Greg Chapman (glchapman) Date: 2005-01-23 14:26
Logged In: YES 

I'm attaching a new patch (tokenizer.c.2.diff), which should
be better since it avoids some unnecessary allocations and
decoding/encoding.  Hopefully, I haven't made any
unwarranted assumptions about UTF8.
msg47531 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-02-23 19:48
Logged In: YES 

I think the patch looks good. Staring at it for a while I
believe the reference counts are OK. Can you incorporate into the test suite?
msg47532 - (view) Author: Greg Chapman (glchapman) Date: 2005-02-27 15:22
Logged In: YES 

After thinking about it some more, I realized that fp_readl
doesn’t have to do any utf8 manipulation.  If the string it
copies into the buffer does not include a \n, the buffer is
expanded and fp_readl is called again until a \n is returned
(or the end of the file is reached).  It doesn’t matter if,
during this loop, the buffer contains temporarily broken
utf8 characters.  So I’m attaching another new patch.  This
copies as many characters as fit into the buffer and saves
any overflow into a new string object (stored in tok
>decoding_buffer).  When it is called again by the loop, it
uses this overflow string to copy characters into the buffer.

I’m also attaching a zip file with a modified
and some ancillary files.  To test fp_readl, python needs to
read from a file.  Perhaps it would be better to generate
temporary files rather than include these extra files. 
Also, although the tests will trigger the assert using a
debug build of the 2.4 source, I’m not getting an access
violation running regrtest against a release build (of 2.4)
– perhaps different tests are in order?

msg47533 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-03-14 21:43
Logged In: YES 

Maybe I've put the test files in the wrong spot, but what
I'm getting is:
> ./python Lib/test/
test_pep263 (__main__.PEP263Test) ... ok
test_evilencoding (__main__.EvilEncodingTest) ... ok
Segmentation fault

You should probably put and alongside and then find
then via os.path.join(__file__.rsplit(os.sep, 1)[0],
msg47534 - (view) Author: Greg Chapman (glchapman) Date: 2005-03-21 13:47
Logged In: YES 

You're quite right, 2.4 AV's reliably using the new
test_pep263.  The copy of python24.dll I was testing against
already had (the first version of) the patch applied. 
Anyway, I'm attaching a diff for test_pep263 (against the
original 2.4 test_pep263) which incorporates your
suggestions about not assuming the current directory.
msg47535 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-05-16 14:11
Logged In: YES 

Here is another version of the patch, this version doesn't
pass any arguments to readline(), because it's no longer
neccessary to limit the line length. What do you think?
msg47536 - (view) Author: Greg Chapman (glchapman) Date: 2005-05-17 13:47
Logged In: YES 

Looks good to me.
msg47537 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-07-12 21:59
Logged In: YES 

Checked in as:
Parser/tokenizer.c 2.78
Date User Action Args
2022-04-11 14:56:09adminsetgithub: 41437
2005-01-13 15:45:20glchapmancreate