Message243376
I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line:
----- start of demo -----
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
... s = f.read()
...
>>> "\ud900" in s
True
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
... for line in f:
... pass
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python\34x64\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval
id continuation byte
>>>
----- end of demo -----
I attached the file used for the demo such that you can reproduce the problem.
If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears.
The original file I noticed the problem with was 73 MB. The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x.
If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue.
All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary. |
|
Date |
User |
Action |
Args |
2015-05-16 22:56:34 | RalfM | set | recipients:
+ RalfM, vstinner, ezio.melotti |
2015-05-16 22:56:34 | RalfM | set | messageid: <1431816994.33.0.797161516902.issue24214@psf.upfronthosting.co.za> |
2015-05-16 22:56:34 | RalfM | link | issue24214 messages |
2015-05-16 22:56:33 | RalfM | create | |
|