Message167831
Oh, set_encoding.patch is wrong:
+ offset = self._decoded_chars_used - len(next_input)
self._decoded_chars_used is a number of Unicode characters, len(next_input) is a number of bytes. It only works with 7 and 8 bit encodings like ascii or latin1, but not with multibyte encodings like utf8 or ucs-4.
> peeking into the underlying buffer would be enough to
> handle encoding detection.
I wrote a new patch using this idea. It does not work (yet?) with non seekable streams. The raw read buffer (bytes string) is not stored in the _snapshot attribute if the stream is not seeakble. It may be changed to solve this issue.
set_encoding-2.patch is still a work-in-progress. It does not patch the _io module for example. |
|
Date |
User |
Action |
Args |
2012-08-09 20:42:50 | vstinner | set | recipients:
+ vstinner, loewis, ishimoto, ncoghlan, pitrou, mrabarnett, Arfrever, methane, rurpy2, serhiy.storchaka |
2012-08-09 20:42:50 | vstinner | set | messageid: <1344544970.59.0.677493089722.issue15216@psf.upfronthosting.co.za> |
2012-08-09 20:42:49 | vstinner | link | issue15216 messages |
2012-08-09 20:42:49 | vstinner | create | |
|