This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author dark-storm
Recipients
Date 2004-12-01.18:51:13
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
StreamReader.readline in codecs.py misinterprets the
size parameter. (File: codecs.py, Line: 296). 
The reported crash happens if a file processed by the
python tokenizer is using a not built-in codec for
source file encoding (i.e. iso-8859-15) and has lines
that are longer than the define BUFSIZ in stdio.h on
the platform python is compiled on. (On Windows for
MSVC++ this define is 512, thus a line that is longer
than 511 characters should suffice to crash python with
the correct encoding). 
The crash happens because the python core assumes that
the StreamReader.readline method returns a string
shorter than the platforms BUFSIZ macro (512 for MSVC). 

The current StreamReader.readline() looks like this:
---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        line = u""
        while True:
            data = self.read(size)
            line += data
            pos = line.find("\n")
            if pos>=0:
                self.charbuffer = line[pos+1:] +
self.charbuffer
                if keepends:
                    line = line[:pos+1]
                else:
                    line = line[:pos]
                return line
            elif not data:
                return line
            if size<8000:
                size *= 2
---------------------------------------

However, the core expects readline() to return at most
a string of the length size. readline() instead passes
size to the read() function.

There are multiple ways of solving this issue. 

a) Make the core reallocate the token memory if the
string returned by the readline function exceeds the
expected size (risky if the line is very long). 

b) Fix, rename, remodel,  change StreamReader.readline.
If no other part of the core or code expects size to do
nothing useful, the following readline() function does
behave correctly with arbitrarily long lines:

---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        data = self.read(size)
        pos = data.find("\n")
        if pos>=0:
            self.charbuffer = data[pos+1:] +
self.charbuffer
            if keepends:
                data = data[:pos+1]
            else:
                data = data[:pos]
            return data
        else:
       	    return data # Return the data directly
since otherwise 
                        # we would exceed the given size.
---------------------------------------

Reproducing this bug:
This bug can only be reproduced if your platform does
use a small BUFSIZ for stdio operations (i.e. MSVC), i
didn't check but Linux might use more than 8000 byte
for the default buffer size. That means you would have
to use a line with more than 8000 characters to
reproduce this.

In addition, this bug only shows if the python
libraries StreamReader.readline() method is used, if
internal codecs like Latin-1 are used, there is no
crash since the method isn't used.

I've attached a file that crashes my Python 2.4 on
Windows using the official binary released on
python.org today.

Last but not least here is the assertion that is broken
if python is compiled in debug mode with MSVC++ 7.1:

Assertion failed: strlen(str) < (size_t)size, file
\Python-2.4\Parser\tokenizer.
c, line 367
History
Date User Action Args
2007-08-23 14:28:02adminlinkissue1076985 messages
2007-08-23 14:28:02admincreate