This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author doerwalter
Recipients
Date 2005-05-19.19:06:30
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=89016

> Walter, as I've said before: I know that you need buffering
> for the UTF-x readline support, but I don't see a
> requirement for it in most of the other codecs

The *charbuffer* is required for readline support, but the
*bytebuffer* is required for any non-charmap codec.

To have different buffering modes we'd either need a flag in
the StreamReader or use different classes, i.e. a class
hierarchy like the following:

StreamReader
   UnbufferedStreamReader
      CharmapStreamReader
         ascii.StreamReader
         iso_8859_1.StreamReader
   BufferedStreamReader
         utf_8.StreamReader

I don't think that we should introduce such a big change in
2.4.x. Furthermore there is another problem: The 2.4
buffering code automatically gives us universal newline
support. If you have a file foo.txt containing "a\rb", with
Python 2.4 you get:

>>> list(codecs.open("foo.txt", "rb", "latin-1"))
[u'a\r', u'b']

But with Python 2.3 you get:

>>> list(codecs.open("foo.txt", "rb", "latin-1"))
[u'a\rb']

If we would switch to the old StreamReader for the charmap
codecs, suddenly the stream reader for e.g. latin-1 and
UTF-8 would behave differently. Of course we could change
the buffering stream reader to only split lines on "\n", but
this would change functionality again.

> Your argument about applications making implications on the
> file position after having used .readline() is true, but
> still many applications rely on this behavior which is not
> as far fetched as it may seem given that they normally only
> expect 8-bit data.

If an application doesn't mix calls to read() with calls to
readline() (or different size values in these calls), the
change in behaviour from 2.3 to 2.4 shouldn't be this big.

No matter what we decide for the codecs, the tokenizer is
broken and should be fixed.

> Wouldn't it make things a lot safer if we only use buffering
> per default in the UTF-x codecs and revert back to the old
> non-buffered behavior for the other codecs which has worked
> well in the past ?!

Only if we'd drop the additional functionality added in 2.4.
(universal
newline support, the chars argument for read() and the
keepends argument for readline().), which I think could only
be done for 2.5.

> About your patch:
>
> * Please explain what firstline is supposed to do
> (preferably in the doc-string).

OK, I've added an explanation in the docstring.

> * Why is firstline always set in .readline() ?

firstline is only supposed to be used by readline(). We could
rename the argument to _firstline to make it clear that this is
a private parameter, or introduce a new method _read() that
has a firstline parameter. Then read() calls _read() with
firstline==False and readline() calls _read() with
firstline==True.

The purpose of firstline is to make sure that if an input
stream has
its first decoding error in line n, that the
UnicodeDecodeError will only be raised by the n'th call to
readline().

> * Please remove the print repr()

OK, done.

> * You cannot always be sure that exc has a .start attribute,
> so you need to accomocate for this situation as well

I don't understand that. A UnicodeDecodeError is created by
PyUnicodeDecodeError_Create() in exceptions.c, so any
UnicodeDecodeError instance without a start attribute would
be severely broken.

Thanks for reviewing the patch.
History
Date User Action Args
2007-08-23 14:30:50adminlinkissue1178484 messages
2007-08-23 14:30:50admincreate