Message24945
Logged In: YES
user_id=89016
> Walter, as I've said before: I know that you need buffering
> for the UTF-x readline support, but I don't see a
> requirement for it in most of the other codecs
The *charbuffer* is required for readline support, but the
*bytebuffer* is required for any non-charmap codec.
To have different buffering modes we'd either need a flag in
the StreamReader or use different classes, i.e. a class
hierarchy like the following:
StreamReader
UnbufferedStreamReader
CharmapStreamReader
ascii.StreamReader
iso_8859_1.StreamReader
BufferedStreamReader
utf_8.StreamReader
I don't think that we should introduce such a big change in
2.4.x. Furthermore there is another problem: The 2.4
buffering code automatically gives us universal newline
support. If you have a file foo.txt containing "a\rb", with
Python 2.4 you get:
>>> list(codecs.open("foo.txt", "rb", "latin-1"))
[u'a\r', u'b']
But with Python 2.3 you get:
>>> list(codecs.open("foo.txt", "rb", "latin-1"))
[u'a\rb']
If we would switch to the old StreamReader for the charmap
codecs, suddenly the stream reader for e.g. latin-1 and
UTF-8 would behave differently. Of course we could change
the buffering stream reader to only split lines on "\n", but
this would change functionality again.
> Your argument about applications making implications on the
> file position after having used .readline() is true, but
> still many applications rely on this behavior which is not
> as far fetched as it may seem given that they normally only
> expect 8-bit data.
If an application doesn't mix calls to read() with calls to
readline() (or different size values in these calls), the
change in behaviour from 2.3 to 2.4 shouldn't be this big.
No matter what we decide for the codecs, the tokenizer is
broken and should be fixed.
> Wouldn't it make things a lot safer if we only use buffering
> per default in the UTF-x codecs and revert back to the old
> non-buffered behavior for the other codecs which has worked
> well in the past ?!
Only if we'd drop the additional functionality added in 2.4.
(universal
newline support, the chars argument for read() and the
keepends argument for readline().), which I think could only
be done for 2.5.
> About your patch:
>
> * Please explain what firstline is supposed to do
> (preferably in the doc-string).
OK, I've added an explanation in the docstring.
> * Why is firstline always set in .readline() ?
firstline is only supposed to be used by readline(). We could
rename the argument to _firstline to make it clear that this is
a private parameter, or introduce a new method _read() that
has a firstline parameter. Then read() calls _read() with
firstline==False and readline() calls _read() with
firstline==True.
The purpose of firstline is to make sure that if an input
stream has
its first decoding error in line n, that the
UnicodeDecodeError will only be raised by the n'th call to
readline().
> * Please remove the print repr()
OK, done.
> * You cannot always be sure that exc has a .start attribute,
> so you need to accomocate for this situation as well
I don't understand that. A UnicodeDecodeError is created by
PyUnicodeDecodeError_Create() in exceptions.c, so any
UnicodeDecodeError instance without a start attribute would
be severely broken.
Thanks for reviewing the patch. |
|
Date |
User |
Action |
Args |
2007-08-23 14:30:50 | admin | link | issue1178484 messages |
2007-08-23 14:30:50 | admin | create | |
|