Message24863
Logged In: YES
user_id=86307
I think the foo2.py from 1163244 is probably the same bug;
at any rate, the reason for it is that a \r is at the
beginning of the last line when read in by decoding_fgets.
I have simpler test file which shows the bug which I'll
email to Walter (you basically just have to get a \r as the
last character in the block read by StreamReader, so that
atcr will be true).
The problem is caused by StreamReader.readline doing:
if self.atcr and data.startswith(u"\n"):
data = data[1:]
since the tokenizer relies on '\n' as the line break
character, but it will never see the '\n' removed by the above.
FWIW (not much), I think the 2.4 StreamReader.readline
actually made more sense than the current code, although a
few changes would seem useful (see below). I don't think it
is particularly useful to treat the size parameter as a
fixed maximum number of bytes to read, since the number of
bytes read has no fixed relationship to the number of
decoded unicode characters (and also, in the case of the
tokenizer, no fixed relationship to the number of bytes of
encoded utf8). Also, given the current code, the size
parameter is effectively ignored if there is a charbuffer:
if you have 5 characters sitting in the charbuffer and use a
size of 0x1FF, you only get back the 5 characters, even if
they do not end in a linebreak. For the tokenizer, this
means an unnecessary PyMem_RESIZE and an extra call to
decoding_readline roughly every BUFSIZ bytes in the file
(since the tokenizer assumes failure to fetch a complete
line means its buffer is too small, whereas in fact it was
caused by an incomplete line being stored in the
StreamReader's charbuffer).
As to changes from 2.4, if the unicode object were to add a
findlinebreak method which returns the index of the first
character for which Py_UNICODE_ISLINEBREAK is true, readline
could use that instead of find("\n"). If it used such a
method, readline would also need to explicitly handle a
"\r\n" sequence, including a potential read(1) if a '\r'
appears at the end of the data (in the case where size is
not None). Of course, one problem with that idea is it
requires a new method, which may not be allowed until 2.5,
and the 2.4.1 behavior definitely needs to be fixed some
way. (Interestingly, it looks to me like sre has everything
necessary for searching for unicode linebreaks except syntax
with which to express the idea in a pattern (maybe I'm
missing something, but I can't find a way to get a compiled
pattern to emit CATEGORY_UNI_LINEBREAK).)
|
|
Date |
User |
Action |
Args |
2007-08-23 14:30:41 | admin | link | issue1175396 messages |
2007-08-23 14:30:41 | admin | create | |
|