Message 23489 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	doerwalter
Recipients
Date	2004-12-03.11:58:31
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=89016 > In Python 2.3, the size parameter was simply passed down > to the stream's .readline() method, so semantics were > defined by the stream rather than the codec. In most cases this meant that there were at most size bytes read from the byte stream. It seems that tokenizer.c:fp_readl () assumes that, so it broke with the change. > I think that we should restore this kind of behaviour > for Python 2.4.1. That's not possible (switching back to calling readline() on the bytestream), because it breaks the UTF-16 decoder, but we can get something that is close. > What was the reason why you introduced the change > in semantics ? 1) To get readline() to work with UTF-16 it's no longer possible to call readline() for the byte stream. This has to be replaced by one or more calls to read(). 2) As you say size was always just a hint. With line buffering (which is required for UTF-16 readline) this hint becomes even more meaningless. So I'd say: 1) Fix tokenizer.c:fp_readl(). It looks to me like the code had a problem in this spot anyway: There's an assert(strlen(str) < (size_t)size); /* XXX / in the code, and a string can* get longer when it's encoded to UTF-8 which fp_readl() does. dark-storm, if you can provide a patch for this problem, go ahead. 2) change readline(), so that calling it with a size parameter results in only one call to read(). If read() is called with chars==-1 (which it will in this case), this will in turn only call read() for the byte stream once (at most). If size isn't specified the caller should be able to cope with any returned string length, so I think the current behaviour (calling read() multiple times until a "\n" shows up) can be kept. BTW, the logic in read() looks rather convoluted to me now that a look at it a second time. Should we clean this up a bit?

Logged In: YES 
user_id=89016

> In Python 2.3, the size parameter was simply passed down
> to the stream's .readline() method, so semantics were
> defined by the stream rather than the codec.

In most cases this meant that there were at most size bytes 
read from the byte stream. It seems that tokenizer.c:fp_readl
() assumes that, so it broke with the change.

> I think that we should restore this kind of behaviour
> for Python 2.4.1.

That's not possible (switching back to calling readline() on the 
bytestream), because it breaks the UTF-16 decoder, but we 
can get something that is close.

> What was the reason why you introduced the change
> in semantics ?

1) To get readline() to work with UTF-16 it's no longer 
possible to call readline() for the byte stream. This has to be 
replaced by one or more calls to read().

2) As you say size was always just a hint. With line buffering 
(which is required for UTF-16 readline) this hint becomes even 
more meaningless.

So I'd say:

1) Fix tokenizer.c:fp_readl(). It looks to me like the code had 
a problem in this spot anyway: There's an

assert(strlen(str) < (size_t)size); /* XXX */

in the code, and a string *can* get longer when it's encoded 
to UTF-8 which fp_readl() does.

dark-storm, if you can provide a patch for this problem, go 
ahead.

2) change readline(), so that calling it with a size parameter 
results in only one call to read(). If read() is called with 
chars==-1 (which it will in this case), this will in turn only call 
read() for the byte stream once (at most). If size isn't 
specified the caller should be able to cope with any returned
string length, so I think the current behaviour (calling read() 
multiple times until a "\n" shows up) can be kept.

BTW, the logic in read() looks rather convoluted to me now 
that a look at it a second time. Should we clean this up a bit?

History
Date	User	Action	Args
2007-08-23 14:28:03	admin	link	issue1076985 messages
2007-08-23 14:28:03	admin	create