classification
Title: UTF-8 incremental decoder doesn't support surrogatepass correctly
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: RalfM, ezio.melotti, inada.naoki, serhiy.storchaka, vstinner
Priority: high Keywords: patch

Created on 2015-05-16 22:56 by RalfM, last changed 2019-03-30 13:54 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
Demo.txt RalfM, 2015-05-16 22:56 File to demonstrate the issue
surrogatepass.patch vstinner, 2016-07-27 16:33 review
Pull Requests
URL Status Linked Edit
PR 12603 merged serhiy.storchaka, 2019-03-28 13:27
PR 12627 merged miss-islington, 2019-03-30 06:23
Messages (7)
msg243376 - (view) Author: (RalfM) Date: 2015-05-16 22:56
I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line:

----- start of demo -----
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
...   s = f.read()
...
>>> "\ud900" in s
True
>>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
...   for line in f:
...     pass
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python\34x64\lib\codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval
id continuation byte
>>>
----- end of demo -----

I attached the file used for the demo such that you can reproduce the problem.

If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears.

The original file I noticed the problem with was 73 MB.  The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x.

If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue.

All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary.
msg271412 - (view) Author: (RalfM) Date: 2016-07-26 20:50
I just tested Python 3.6.0a3, and that (mis)behaves exactly like 3.4.3.
msg271461 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-07-27 16:33
Attached patch fixes the UTF-8 decoder to support correctly incremental decoder using surrogatepass error handler.

The bug occurs when b'\xed\xa4\x80' is decoded in two parts: the first two bytes (b'\xed\xa4'), and then the last byte (b'\x80').

It works as expected if we decode the first byte (b'\xed') and then the two last bytes (b'\xa4\x80').

My patch tries to keep best performances for the UTF-8/strict decoder.

@Serhiy: Would you mind to review my patch since you helped to design the fast UTF-8 decoder?
msg271839 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-08-02 18:53
The patch slows down decoding up to 20%.

$ ./python -m timeit -s 'b = b"\xc4\x80"*10000' -- 'b.decode()'
Unpatched:  10000 loops, best of 3: 50.8 usec per loop
Patched:    10000 loops, best of 3: 63.3 usec per loop

And I'm not sure that fixing only for the surrogatepass handler is enough. Other standard error handlers look working, but what if a user handler consumes more then one byte?
msg339036 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-28 13:28
PR 12603 fixes this issue in more general way and does not affect performance.
msg339177 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-30 06:23
New changeset 7a465cb5ee7e298cae626ace1fc3e7d97df79f2e by Serhiy Storchaka in branch 'master':
bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603)
https://github.com/python/cpython/commit/7a465cb5ee7e298cae626ace1fc3e7d97df79f2e
msg339199 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-30 13:52
New changeset bd48280cb66544827952ca91e326cbb178c8c461 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.7':
bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603) (GH-12627)
https://github.com/python/cpython/commit/bd48280cb66544827952ca91e326cbb178c8c461
History
Date User Action Args
2019-03-30 13:54:46serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-03-30 13:52:43serhiy.storchakasetmessages: + msg339199
2019-03-30 06:23:52miss-islingtonsetpull_requests: + pull_request12560
2019-03-30 06:23:42serhiy.storchakasetmessages: + msg339177
2019-03-28 13:28:40serhiy.storchakasetmessages: + msg339036
2019-03-28 13:27:43serhiy.storchakasetversions: + Python 3.7, - Python 3.9
2019-03-28 13:27:09serhiy.storchakasetpull_requests: + pull_request12543
2019-03-28 13:18:01serhiy.storchakasetversions: + Python 3.8, Python 3.9, - Python 3.5, Python 3.6
2019-03-28 13:17:47serhiy.storchakasetassignee: serhiy.storchaka
2019-03-28 11:40:57inada.naokisetnosy: + inada.naoki
2016-08-02 18:53:51serhiy.storchakasetpriority: normal -> high

messages: + msg271839
components: + Interpreter Core
2016-07-27 17:41:02serhiy.storchakasetnosy: + serhiy.storchaka
stage: patch review

versions: + Python 3.5, - Python 3.4
2016-07-27 16:33:32vstinnersettitle: Exception with utf-8, surrogatepass and incremental decoding -> UTF-8 incremental decoder doesn't support surrogatepass correctly
2016-07-27 16:33:22vstinnersetfiles: + surrogatepass.patch
keywords: + patch
messages: + msg271461
2016-07-26 20:50:35RalfMsetmessages: + msg271412
versions: + Python 3.6
2015-05-16 22:56:34RalfMcreate