This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author doerwalter
Recipients
Date 2005-05-16.14:20:19
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=89016

> > 1) How do we handle the problem of a truncated line, if the
> > data comes from the charbuffer instead of being read from
> > > the stream?
> > 
> > My suggestion is to make the top of the loop look like:
> > 
> >     while True:
> >         havecharbuffer = bool(self.charbuffer)
> > 
> > And then the break condition (when no line break found)
> > should be:
> > 
> >     # we didn't get anything or this was our only try
> >     if not data or (size is not None and not
havecharbuffer):
> > 
> > (too many negatives in that).  Anyway, the idea is that, if
> > size specified, there will be at most one read of the
> > underlying stream (using the size).  So if you enter the
> > loop with a charbuffer, and that charbuffer does not have a
> > line break, then you redo the loop once (the second time it
> > will break, because havecharbuffer will be False).

This makes sense. However, with the current state of the
tokenizer this might be to dangerous, because the resulting
line might be twice as long. So fixing the tokenizer should
be the first step. BTW, your patch fixes the problems with
the fix for #1178484, so I think I'm going to apply the
patch in the next days, if there are no objections.

> > Also, not sure about this, but should the size parameter
> > default to -1 (to keep it in sync with read)?

None seems to be a better default from an API viewpoint,
but -1 is better for "C compatibility".

> > As to issue 2, it looks like it should be possible to get
> > the line number right, because the UnicodeDecodeError
> > exception object has all the necessary information in it
> > (including the original string).  I think this should be
> > done by fp_readl (in tokenizer.c).  

The patch for #1178484 fixes this. Combined with this patch
I think we're in good shape.

> > By the way, using a findlinebreak function (using sre) turns
> > out to be slower than splitting/joining when no size is
> > specified (which I assume will be the overwhelmingly most
> > common case), so that turned out to be a bad idea on my
part.

Coding this on the C level and using Py_UNICODE_ISLINEBREAK()
should be the fastest version, but I don't know if this is worth
it.
History
Date User Action Args
2007-08-23 14:30:42adminlinkissue1175396 messages
2007-08-23 14:30:42admincreate