Message23489
Logged In: YES
user_id=89016
> In Python 2.3, the size parameter was simply passed down
> to the stream's .readline() method, so semantics were
> defined by the stream rather than the codec.
In most cases this meant that there were at most size bytes
read from the byte stream. It seems that tokenizer.c:fp_readl
() assumes that, so it broke with the change.
> I think that we should restore this kind of behaviour
> for Python 2.4.1.
That's not possible (switching back to calling readline() on the
bytestream), because it breaks the UTF-16 decoder, but we
can get something that is close.
> What was the reason why you introduced the change
> in semantics ?
1) To get readline() to work with UTF-16 it's no longer
possible to call readline() for the byte stream. This has to be
replaced by one or more calls to read().
2) As you say size was always just a hint. With line buffering
(which is required for UTF-16 readline) this hint becomes even
more meaningless.
So I'd say:
1) Fix tokenizer.c:fp_readl(). It looks to me like the code had
a problem in this spot anyway: There's an
assert(strlen(str) < (size_t)size); /* XXX */
in the code, and a string *can* get longer when it's encoded
to UTF-8 which fp_readl() does.
dark-storm, if you can provide a patch for this problem, go
ahead.
2) change readline(), so that calling it with a size parameter
results in only one call to read(). If read() is called with
chars==-1 (which it will in this case), this will in turn only call
read() for the byte stream once (at most). If size isn't
specified the caller should be able to cope with any returned
string length, so I think the current behaviour (calling read()
multiple times until a "\n" shows up) can be kept.
BTW, the logic in read() looks rather convoluted to me now
that a look at it a second time. Should we clean this up a bit?
|
|
Date |
User |
Action |
Args |
2007-08-23 14:28:03 | admin | link | issue1076985 messages |
2007-08-23 14:28:03 | admin | create | |
|