Issue1076985
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2004-12-01 18:51 by dark-storm, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
test.py | dark-storm, 2004-12-01 18:51 | Test file | ||
codecs.diff | dark-storm, 2004-12-01 20:36 | Diff file | ||
gurk.py | doerwalter, 2004-12-02 20:58 | |||
fixread.diff | doerwalter, 2004-12-03 19:43 | |||
fix_readline_and_read.diff | doerwalter, 2004-12-15 23:55 | |||
fix_readline_and_read2.diff | doerwalter, 2004-12-17 18:12 | |||
fix_readline_and_read3.diff | doerwalter, 2004-12-21 15:08 |
Messages (21) | |||
---|---|---|---|
msg23481 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-01 18:51 | |
StreamReader.readline in codecs.py misinterprets the size parameter. (File: codecs.py, Line: 296). The reported crash happens if a file processed by the python tokenizer is using a not built-in codec for source file encoding (i.e. iso-8859-15) and has lines that are longer than the define BUFSIZ in stdio.h on the platform python is compiled on. (On Windows for MSVC++ this define is 512, thus a line that is longer than 511 characters should suffice to crash python with the correct encoding). The crash happens because the python core assumes that the StreamReader.readline method returns a string shorter than the platforms BUFSIZ macro (512 for MSVC). The current StreamReader.readline() looks like this: --------------------------------------- def readline(self, size=None, keepends=True): """ Read one line from the input stream and return the decoded data. size, if given, is passed as size argument to the read() method. """ if size is None: size = 10 line = u"" while True: data = self.read(size) line += data pos = line.find("\n") if pos>=0: self.charbuffer = line[pos+1:] + self.charbuffer if keepends: line = line[:pos+1] else: line = line[:pos] return line elif not data: return line if size<8000: size *= 2 --------------------------------------- However, the core expects readline() to return at most a string of the length size. readline() instead passes size to the read() function. There are multiple ways of solving this issue. a) Make the core reallocate the token memory if the string returned by the readline function exceeds the expected size (risky if the line is very long). b) Fix, rename, remodel, change StreamReader.readline. If no other part of the core or code expects size to do nothing useful, the following readline() function does behave correctly with arbitrarily long lines: --------------------------------------- def readline(self, size=None, keepends=True): """ Read one line from the input stream and return the decoded data. size, if given, is passed as size argument to the read() method. """ if size is None: size = 10 data = self.read(size) pos = data.find("\n") if pos>=0: self.charbuffer = data[pos+1:] + self.charbuffer if keepends: data = data[:pos+1] else: data = data[:pos] return data else: return data # Return the data directly since otherwise # we would exceed the given size. --------------------------------------- Reproducing this bug: This bug can only be reproduced if your platform does use a small BUFSIZ for stdio operations (i.e. MSVC), i didn't check but Linux might use more than 8000 byte for the default buffer size. That means you would have to use a line with more than 8000 characters to reproduce this. In addition, this bug only shows if the python libraries StreamReader.readline() method is used, if internal codecs like Latin-1 are used, there is no crash since the method isn't used. I've attached a file that crashes my Python 2.4 on Windows using the official binary released on python.org today. Last but not least here is the assertion that is broken if python is compiled in debug mode with MSVC++ 7.1: Assertion failed: strlen(str) < (size_t)size, file \Python-2.4\Parser\tokenizer. c, line 367 |
|||
msg23482 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-01 19:13 | |
Logged In: YES user_id=89016 What I get when I execute your test.py on Windows is: --- Fatal Python error: GC object already tracked This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. --- Is this the expected failure? Can you attach a proper diff for your changes to StreamReader.readline()? |
|||
msg23483 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-01 20:32 | |
Logged In: YES user_id=377356 "Can you attach a proper diff for your changes to StreamReader.readline()?" I would like to, but i don't have access to a proper diff utility on windows. I will try to get one and then attach the diff file. |
|||
msg23484 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-01 20:36 | |
Logged In: YES user_id=377356 I attached the .diff file for my patch. I hope i got this right. |
|||
msg23485 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-02 20:58 | |
Logged In: YES user_id=89016 I couldn't get Linux Python to crash or assert with the attached gurk.py, Windows Python crashed at i=70. The problem we have is that Python 2.4 changed the meaning of the size parameter in StreamReader.readline(). There are at least four possible interpretations: 1) Never read more than size bytes from the underlying byte stream in one call to readline() 2) Never read more than size characters from the underlying byte stream in one call to readline() 3) If calling readline() requires reading from the underlying byte stream, do the reading in chunks of size bytes. 4) Never return more than size characters from readline(), if there's no linefeed within size characters the result is a partial line. In Python 2.3 1) was used with the implicit assumption that this guarantees that the result will never be longer than size. You're patch looks like it could restore the old behaviour, but we'd loose the ability to reliably read a line until a "\n" is available without passing size=-1 into read() with would read the whole stream. |
|||
msg23486 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-02 21:21 | |
Logged In: YES user_id=38388 Walter, your analysis is not quite right: the size parameter for .readline() was always just a hint and never intended to limit the number of bytes returned by the method (like the size parameter in .read()). In Python 2.3, the size parameter was simply passed down to the stream's .readline() method, so semantics were defined by the stream rather than the codec. I think that we should restore this kind of behaviour for Python 2.4.1. Any code which makes assumptions on the length of the returned data is simply broken. If the Python parser makes such an assumption it should be fixed. Again, the size parameter is just a hint for the implementation. It might as well ignore it completely. The reason for having the parameter is to limit the number of bytes read in case the stream does not include line breaks - otherwise, .readline() could end up reading the whole file into memory. What was the reason why you introduced the change in semantics ? I would have complained about it, had I known about this change (I only saw you change to .read() which was required for the stateful codecs). |
|||
msg23487 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-02 21:35 | |
Logged In: YES user_id=377356 The problem here is that the tokenizer tries to read at most 512 bytes into its token buffer and then read another 512 bytes if the line isn't complete yet. At least that is what a quick look at the tokenizers freadln (or whatever the function the assert is in is called) function made me think. The problem here is that the tokens buffer either has to be reallocated (easy) or another solution has to be found. If the reallocation of the tokens buffer is sufficient, i can easily provide a patch for that. |
|||
msg23488 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-02 21:38 | |
Logged In: YES user_id=377356 I forgot to reply to doerwalter's comment. *snip* I couldn't get Linux Python to crash or assert with the attached gurk.py, Windows Python crashed at i=70. *snip* Linux is using a larger BUFSIZ for me. MUCH larger i have to add. MSVC used 512 bytes for me while the c standard library on my linux box defines BUFSIZ to be 8192 bytes long. That should suffice for looooong lines. |
|||
msg23489 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-03 11:58 | |
Logged In: YES user_id=89016 > In Python 2.3, the size parameter was simply passed down > to the stream's .readline() method, so semantics were > defined by the stream rather than the codec. In most cases this meant that there were at most size bytes read from the byte stream. It seems that tokenizer.c:fp_readl () assumes that, so it broke with the change. > I think that we should restore this kind of behaviour > for Python 2.4.1. That's not possible (switching back to calling readline() on the bytestream), because it breaks the UTF-16 decoder, but we can get something that is close. > What was the reason why you introduced the change > in semantics ? 1) To get readline() to work with UTF-16 it's no longer possible to call readline() for the byte stream. This has to be replaced by one or more calls to read(). 2) As you say size was always just a hint. With line buffering (which is required for UTF-16 readline) this hint becomes even more meaningless. So I'd say: 1) Fix tokenizer.c:fp_readl(). It looks to me like the code had a problem in this spot anyway: There's an assert(strlen(str) < (size_t)size); /* XXX */ in the code, and a string *can* get longer when it's encoded to UTF-8 which fp_readl() does. dark-storm, if you can provide a patch for this problem, go ahead. 2) change readline(), so that calling it with a size parameter results in only one call to read(). If read() is called with chars==-1 (which it will in this case), this will in turn only call read() for the byte stream once (at most). If size isn't specified the caller should be able to cope with any returned string length, so I think the current behaviour (calling read() multiple times until a "\n" shows up) can be kept. BTW, the logic in read() looks rather convoluted to me now that a look at it a second time. Should we clean this up a bit? |
|||
msg23490 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-03 14:46 | |
Logged In: YES user_id=38388 >>In Python 2.3, the size parameter was simply passed down >>to the stream's .readline() method, so semantics were >>defined by the stream rather than the codec. > > > In most cases this meant that there were at most size bytes > read from the byte stream. It seems that tokenizer.c:fp_readl > () assumes that, so it broke with the change. > > >>I think that we should restore this kind of behaviour >>for Python 2.4.1. > > > That's not possible (switching back to calling readline() on the > bytestream), because it breaks the UTF-16 decoder, but we > can get something that is close. The problem with the change is that it applies to *all* codecs. If only the UTF-16 codec has a problem with the standard logic, it should override the .readline() method as necessary, but this should not affect all the other codecs. Unless, I'm missing something, the other codecs work just fine with the old implementation of the method. >>What was the reason why you introduced the change >>in semantics ? > > > 1) To get readline() to work with UTF-16 it's no longer > possible to call readline() for the byte stream. This has to be > replaced by one or more calls to read(). > > 2) As you say size was always just a hint. With line buffering > (which is required for UTF-16 readline) this hint becomes even > more meaningless. That's OK for the UTF-16 codec, but why the generic change in the base class ? > So I'd say: > > 1) Fix tokenizer.c:fp_readl(). It looks to me like the code had > a problem in this spot anyway: There's an > > assert(strlen(str) < (size_t)size); /* XXX */ > > in the code, and a string *can* get longer when it's encoded > to UTF-8 which fp_readl() does. > > dark-storm, if you can provide a patch for this problem, go > ahead. +1 > 2) change readline(), so that calling it with a size parameter > results in only one call to read(). If read() is called with > chars==-1 (which it will in this case), this will in turn only call > read() for the byte stream once (at most). If size isn't > specified the caller should be able to cope with any returned > string length, so I think the current behaviour (calling read() > multiple times until a "\n" shows up) can be kept. +1 for UTF-16 I'm not sure whether the current implementation is needed for all other codecs. > BTW, the logic in read() looks rather convoluted to me now > that a look at it a second time. Should we clean this up a bit? If that's possible, yes :-) |
|||
msg23491 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-03 15:42 | |
Logged In: YES user_id=377356 I checked the decoding_fgets function (and the enclosed call to fp_readl). The patch is more problematic than i thought since decoding_fgets not only takes a pointer to the token state but also a pointer to a destination string buffer. Reallocating the buffer within fp_readl would mean a very very bad hack since you'd have to reallocate "foreign" memory based on a pointer comparison (comparing the passed string buffers pointer against tok->inp || tok->buf...). As it stands now, patching the tokenizer would mean changing the function signatures or otherwise change the structure (more error prone). Another possible solution would be to provide a specialized readline() function for which the assumption that at most size bytes are returned is correct. Oh and about that UTF-8 decoding. readline()'s size restriction works on the already decoded string (at least it should), so that shouldn't be an issue. Maybe another optional parameter should be added to readline() called limit=None which doesn't limit the returned string by default, but does so if the parameter is a positive number. Just my 0.2$ |
|||
msg23492 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-03 19:43 | |
Logged In: YES user_id=89016 > The problem with the change is that it applies to *all* > codecs. If only the UTF-16 codec has a problem with the > standard logic, it should override the .readline() > method as necessary, but this should not affect all > the other codecs. readline() had to be rewritten anyway to take the bytebuffer into account. True, the bytebuffer is only needed for UTF-8, UTF-16, UTF-16-LE, UTF-16-BE and maybe a few others, but unless we'd introduced a ByteBufferStreamReader that those codecs can subclass this would have meant code duplication. I'try to come up with a readline() patch (for the base class readline() next week. > >> BTW, the logic in read() looks rather convoluted to me >> now that a look at it a second time. Should we clean >> this up a bit? > > If that's possible, yes :-) Attached is a patch (fixread.diff) that makes read() a *little* simpler. Making it much simple breaks several tests. |
|||
msg23493 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-03 20:01 | |
Logged In: YES user_id=89016 > I checked the decoding_fgets function (and the enclosed call > to fp_readl). The patch is more problematic than i thought > since decoding_fgets not only takes a pointer to the token > state but also a pointer to a destination string buffer. > Reallocating the buffer within fp_readl would mean a very > very bad hack since you'd have to reallocate "foreign" > memory based on a pointer comparison (comparing the passed > string buffers pointer against tok->inp || tok->buf...). Maybe all pointers pointing into the buffer should be moved into a struct? > As it stands now, patching the tokenizer would mean changing > the function signatures or otherwise change the structure > (more error prone). All the affected function seem to be static, so at least in this regard there shouldn't be any problem. > Another possible solution would be to > provide a specialized readline() function for which the > assumption that at most size bytes are returned is correct. All the codecs would have to provide such a readline(). BTW, the more I look at your patch the more I think that it gets us as close to the old behaviour as we can get. > Oh and about that UTF-8 decoding. readline()'s size > restriction works on the already decoded string (at least it > should), so that shouldn't be an issue. I don't understand that. fp_readl() does the following two calls: buf = PyObject_Call(tok->decoding_readline, args, NULL); utf8 = PyUnicode_AsUTF8String(buf); and puts the resulting byte string into the char * passed in, so even if we fix the readline call the UTF-8 encoded string might still overflow the avaliable space. How can tokenizer.c be sure how much the foo->utf8 transcoding shrinks or expands the string? > Maybe another > optional parameter should be added to readline() called > limit=None which doesn't limit the returned string by > default, but does so if the parameter is a positive number. But limit to what? |
|||
msg23494 - (view) | Author: Sebastian Hartte (dark-storm) | Date: 2004-12-03 21:45 | |
Logged In: YES user_id=377356 <quote> and puts the resulting byte string into the char * passed in, so even if we fix the readline call the UTF-8 encoded string might still overflow the avaliable space. How can tokenizer.c be sure how much the foo->utf8 transcoding shrinks or expands the string? </quote> Woopsie. Thats correct and i missed that. I thought we were talking about the UTF-8 codec here. Then it's correct, that fixing readline() alone probably wont fix the issue. |
|||
msg23495 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-15 23:55 | |
Logged In: YES user_id=89016 OK, here is a patch (fix_readline_and_read.diff) that keeps the new readline behaviour when size is None and calls read only once (i.e. the 2.3 behaviour) otherwise. The "read() cleanup patch" is also included. Running dark-storm's test script on Windows with the patch applied confirms that it fixed this particular problem (of course the tokenizer should still be fixed). Marc-André, if there are no objections, can I check this in? |
|||
msg23496 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-17 12:59 | |
Logged In: YES user_id=38388 Walter, the fix to .read() looks OK, but the implementation of .readline() is not correct: there are far more line break characters in Unicode than just \n. The correct way to check would be by using .splitlines() which does know about all the possible line break characters in Unicode, plus it also handles line ends such as \r\n and just \r correctly. If .splitlines() returns a list with more than one entry you know that you've hit a line end and you can push the remaining entries entries back onto the .charbuffer. Another nit: you should bump the default for readsize to 72 - the average line size used in text files. |
|||
msg23497 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-17 18:12 | |
Logged In: YES user_id=89016 Here is a second version of the patch (fix_readline_and_read2.diff) that implements all those changes. With this we get "universal newline" support for free. Note that there is still one corner case: If the byte stream is temporarily exhausted and the last character read is a \r, and the next character read (once the stream contains new data) is an \n, this will result in two lines instead of one. |
|||
msg23498 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-17 21:19 | |
Logged In: YES user_id=38388 Looks good, please check it in with one correction: "".join() should be u"".join(). Thanks, Walter ! |
|||
msg23499 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-21 15:08 | |
Logged In: YES user_id=89016 Here is a third version of the patch. Changes are: 1) The problem with the \r\n split between two read() calls is fixed. 2) The first read() from the byte stream in readline() no longer requires characters beyond the first line. 3) Tests for these cases have been added. 4) The join call has been fixed. |
|||
msg23500 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-12-21 15:32 | |
Logged In: YES user_id=38388 Even better. Please check it in. Thanks and Merry Christmas :-) |
|||
msg23501 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2004-12-22 12:56 | |
Logged In: YES user_id=89016 Checked in as: Lib/codecs.py 1.36 Lib/test/test_codecs.py 1.16 Misc/NEWS 1.1213 Lib/codecs.py 1.35.2.1 Lib/test/test_codecs.py 1.15.2.1 Misc/NEWS 1.1193.2.9 |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:08 | admin | set | github: 41270 |
2004-12-01 18:51:13 | dark-storm | create |