This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients
Date 2004-12-03.14:46:29
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=38388

>>In Python 2.3, the size parameter was simply passed down
>>to the stream's .readline() method, so semantics were
>>defined by the stream rather than the codec.
> 
> 
> In most cases this meant that there were at most size bytes 
> read from the byte stream. It seems that tokenizer.c:fp_readl
> () assumes that, so it broke with the change.
> 
> 
>>I think that we should restore this kind of behaviour
>>for Python 2.4.1.
> 
> 
> That's not possible (switching back to calling readline()
on the 
> bytestream), because it breaks the UTF-16 decoder, but we 
> can get something that is close.
 
The problem with the change is that it applies to *all*
codecs. If only the UTF-16 codec has a problem with the
standard logic, it should override the .readline()
method as necessary, but this should not affect all
the other codecs.

Unless, I'm missing something, the other codecs
work just fine with the old implementation of the
method.

>>What was the reason why you introduced the change
>>in semantics ?
> 
> 
> 1) To get readline() to work with UTF-16 it's no longer 
> possible to call readline() for the byte stream. This has
to be 
> replaced by one or more calls to read().
> 
> 2) As you say size was always just a hint. With line
buffering 
> (which is required for UTF-16 readline) this hint becomes
even 
> more meaningless.

That's OK for the UTF-16 codec, but why the generic change
in the base class ?

> So I'd say:
> 
> 1) Fix tokenizer.c:fp_readl(). It looks to me like the
code had 
> a problem in this spot anyway: There's an
> 
> assert(strlen(str) < (size_t)size); /* XXX */
> 
> in the code, and a string *can* get longer when it's encoded 
> to UTF-8 which fp_readl() does.
> 
> dark-storm, if you can provide a patch for this problem, go 
> ahead.

+1
 
> 2) change readline(), so that calling it with a size
parameter 
> results in only one call to read(). If read() is called with 
> chars==-1 (which it will in this case), this will in turn
only call 
> read() for the byte stream once (at most). If size isn't 
> specified the caller should be able to cope with any returned
> string length, so I think the current behaviour (calling
read() 
> multiple times until a "\n" shows up) can be kept.

+1 for UTF-16

I'm not sure whether the current implementation is needed
for all other codecs.
 
> BTW, the logic in read() looks rather convoluted to me now 
> that a look at it a second time. Should we clean this up a
bit?

If that's possible, yes :-)
 
History
Date User Action Args
2007-08-23 14:28:03adminlinkissue1076985 messages
2007-08-23 14:28:03admincreate