Message23490
Logged In: YES
user_id=38388
>>In Python 2.3, the size parameter was simply passed down
>>to the stream's .readline() method, so semantics were
>>defined by the stream rather than the codec.
>
>
> In most cases this meant that there were at most size bytes
> read from the byte stream. It seems that tokenizer.c:fp_readl
> () assumes that, so it broke with the change.
>
>
>>I think that we should restore this kind of behaviour
>>for Python 2.4.1.
>
>
> That's not possible (switching back to calling readline()
on the
> bytestream), because it breaks the UTF-16 decoder, but we
> can get something that is close.
The problem with the change is that it applies to *all*
codecs. If only the UTF-16 codec has a problem with the
standard logic, it should override the .readline()
method as necessary, but this should not affect all
the other codecs.
Unless, I'm missing something, the other codecs
work just fine with the old implementation of the
method.
>>What was the reason why you introduced the change
>>in semantics ?
>
>
> 1) To get readline() to work with UTF-16 it's no longer
> possible to call readline() for the byte stream. This has
to be
> replaced by one or more calls to read().
>
> 2) As you say size was always just a hint. With line
buffering
> (which is required for UTF-16 readline) this hint becomes
even
> more meaningless.
That's OK for the UTF-16 codec, but why the generic change
in the base class ?
> So I'd say:
>
> 1) Fix tokenizer.c:fp_readl(). It looks to me like the
code had
> a problem in this spot anyway: There's an
>
> assert(strlen(str) < (size_t)size); /* XXX */
>
> in the code, and a string *can* get longer when it's encoded
> to UTF-8 which fp_readl() does.
>
> dark-storm, if you can provide a patch for this problem, go
> ahead.
+1
> 2) change readline(), so that calling it with a size
parameter
> results in only one call to read(). If read() is called with
> chars==-1 (which it will in this case), this will in turn
only call
> read() for the byte stream once (at most). If size isn't
> specified the caller should be able to cope with any returned
> string length, so I think the current behaviour (calling
read()
> multiple times until a "\n" shows up) can be kept.
+1 for UTF-16
I'm not sure whether the current implementation is needed
for all other codecs.
> BTW, the logic in read() looks rather convoluted to me now
> that a look at it a second time. Should we clean this up a
bit?
If that's possible, yes :-)
|
|
Date |
User |
Action |
Args |
2007-08-23 14:28:03 | admin | link | issue1076985 messages |
2007-08-23 14:28:03 | admin | create | |
|