Message 23490 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients
Date	2004-12-03.14:46:29
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=38388 >>In Python 2.3, the size parameter was simply passed down >>to the stream's .readline() method, so semantics were >>defined by the stream rather than the codec. > > > In most cases this meant that there were at most size bytes > read from the byte stream. It seems that tokenizer.c:fp_readl > () assumes that, so it broke with the change. > > >>I think that we should restore this kind of behaviour >>for Python 2.4.1. > > > That's not possible (switching back to calling readline() on the > bytestream), because it breaks the UTF-16 decoder, but we > can get something that is close. The problem with the change is that it applies to all codecs. If only the UTF-16 codec has a problem with the standard logic, it should override the .readline() method as necessary, but this should not affect all the other codecs. Unless, I'm missing something, the other codecs work just fine with the old implementation of the method. >>What was the reason why you introduced the change >>in semantics ? > > > 1) To get readline() to work with UTF-16 it's no longer > possible to call readline() for the byte stream. This has to be > replaced by one or more calls to read(). > > 2) As you say size was always just a hint. With line buffering > (which is required for UTF-16 readline) this hint becomes even > more meaningless. That's OK for the UTF-16 codec, but why the generic change in the base class ? > So I'd say: > > 1) Fix tokenizer.c:fp_readl(). It looks to me like the code had > a problem in this spot anyway: There's an > > assert(strlen(str) < (size_t)size); /* XXX / > > in the code, and a string can* get longer when it's encoded > to UTF-8 which fp_readl() does. > > dark-storm, if you can provide a patch for this problem, go > ahead. +1 > 2) change readline(), so that calling it with a size parameter > results in only one call to read(). If read() is called with > chars==-1 (which it will in this case), this will in turn only call > read() for the byte stream once (at most). If size isn't > specified the caller should be able to cope with any returned > string length, so I think the current behaviour (calling read() > multiple times until a "\n" shows up) can be kept. +1 for UTF-16 I'm not sure whether the current implementation is needed for all other codecs. > BTW, the logic in read() looks rather convoluted to me now > that a look at it a second time. Should we clean this up a bit? If that's possible, yes :-)

Logged In: YES 
user_id=38388

>>In Python 2.3, the size parameter was simply passed down
>>to the stream's .readline() method, so semantics were
>>defined by the stream rather than the codec.
> 
> 
> In most cases this meant that there were at most size bytes 
> read from the byte stream. It seems that tokenizer.c:fp_readl
> () assumes that, so it broke with the change.
> 
> 
>>I think that we should restore this kind of behaviour
>>for Python 2.4.1.
> 
> 
> That's not possible (switching back to calling readline()
on the 
> bytestream), because it breaks the UTF-16 decoder, but we 
> can get something that is close.
 
The problem with the change is that it applies to *all*
codecs. If only the UTF-16 codec has a problem with the
standard logic, it should override the .readline()
method as necessary, but this should not affect all
the other codecs.

Unless, I'm missing something, the other codecs
work just fine with the old implementation of the
method.

>>What was the reason why you introduced the change
>>in semantics ?
> 
> 
> 1) To get readline() to work with UTF-16 it's no longer 
> possible to call readline() for the byte stream. This has
to be 
> replaced by one or more calls to read().
> 
> 2) As you say size was always just a hint. With line
buffering 
> (which is required for UTF-16 readline) this hint becomes
even 
> more meaningless.

That's OK for the UTF-16 codec, but why the generic change
in the base class ?

> So I'd say:
> 
> 1) Fix tokenizer.c:fp_readl(). It looks to me like the
code had 
> a problem in this spot anyway: There's an
> 
> assert(strlen(str) < (size_t)size); /* XXX */
> 
> in the code, and a string *can* get longer when it's encoded 
> to UTF-8 which fp_readl() does.
> 
> dark-storm, if you can provide a patch for this problem, go 
> ahead.

+1
 
> 2) change readline(), so that calling it with a size
parameter 
> results in only one call to read(). If read() is called with 
> chars==-1 (which it will in this case), this will in turn
only call 
> read() for the byte stream once (at most). If size isn't 
> specified the caller should be able to cope with any returned
> string length, so I think the current behaviour (calling
read() 
> multiple times until a "\n" shows up) can be kept.

+1 for UTF-16

I'm not sure whether the current implementation is needed
for all other codecs.
 
> BTW, the logic in read() looks rather convoluted to me now 
> that a look at it a second time. Should we clean this up a
bit?

If that's possible, yes :-)

History
Date	User	Action	Args
2007-08-23 14:28:03	admin	link	issue1076985 messages
2007-08-23 14:28:03	admin	create