This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author RalfM
Recipients RalfM, ezio.melotti, vstinner
Date 2015-05-16.22:56:33
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1431816994.33.0.797161516902.issue24214@psf.upfronthosting.co.za>
In-reply-to
Content
I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line:

----- start of demo -----
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
...   s = f.read()
...
>>> "\ud900" in s
True
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
...   for line in f:
...     pass
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python\34x64\lib\codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval
id continuation byte
>>>
----- end of demo -----

I attached the file used for the demo such that you can reproduce the problem.

If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears.

The original file I noticed the problem with was 73 MB.  The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x.

If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue.

All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary.
History
Date User Action Args
2015-05-16 22:56:34RalfMsetrecipients: + RalfM, vstinner, ezio.melotti
2015-05-16 22:56:34RalfMsetmessageid: <1431816994.33.0.797161516902.issue24214@psf.upfronthosting.co.za>
2015-05-16 22:56:34RalfMlinkissue24214 messages
2015-05-16 22:56:33RalfMcreate