Issue 24214: UTF-8 incremental decoder doesn't support surrogatepass correctly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68402

classification

Title:	UTF-8 incremental decoder doesn't support surrogatepass correctly
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	RalfM, ezio.melotti, methane, miss-islington, ned.deily, roufique7, serhiy.storchaka, vstinner, xtreak
Priority:	high	Keywords:	patch

Created on 2015-05-16 22:56 by RalfM, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Demo.txt	RalfM, 2015-05-16 22:56	File to demonstrate the issue
surrogatepass.patch	vstinner, 2016-07-27 16:33		review

Pull Requests
URL	Status	Linked	Edit
PR 12603	merged	serhiy.storchaka, 2019-03-28 13:27
PR 12627	merged	miss-islington, 2019-03-30 06:23
PR 14304	merged	serhiy.storchaka, 2019-06-22 09:57
PR 14368	merged	miss-islington, 2019-06-25 08:54
PR 14369	merged	miss-islington, 2019-06-25 08:54

Messages (20)
msg243376 - (view)	Author: (RalfM)	Date: 2015-05-16 22:56
I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line: ----- start of demo ----- Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f: ... s = f.read() ... >>> "\ud900" in s True >>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f: ... for line in f: ... pass ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python\34x64\lib\codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval id continuation byte >>> ----- end of demo ----- I attached the file used for the demo such that you can reproduce the problem. If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears. The original file I noticed the problem with was 73 MB. The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x. If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue. All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary.
msg271412 - (view)	Author: (RalfM)	Date: 2016-07-26 20:50
I just tested Python 3.6.0a3, and that (mis)behaves exactly like 3.4.3.
msg271461 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-07-27 16:33
Attached patch fixes the UTF-8 decoder to support correctly incremental decoder using surrogatepass error handler. The bug occurs when b'\xed\xa4\x80' is decoded in two parts: the first two bytes (b'\xed\xa4'), and then the last byte (b'\x80'). It works as expected if we decode the first byte (b'\xed') and then the two last bytes (b'\xa4\x80'). My patch tries to keep best performances for the UTF-8/strict decoder. @Serhiy: Would you mind to review my patch since you helped to design the fast UTF-8 decoder?
msg271839 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-08-02 18:53
The patch slows down decoding up to 20%. $ ./python -m timeit -s 'b = b"\xc4\x80"*10000' -- 'b.decode()' Unpatched: 10000 loops, best of 3: 50.8 usec per loop Patched: 10000 loops, best of 3: 63.3 usec per loop And I'm not sure that fixing only for the surrogatepass handler is enough. Other standard error handlers look working, but what if a user handler consumes more then one byte?
msg339036 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-03-28 13:28
PR 12603 fixes this issue in more general way and does not affect performance.
msg339177 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-03-30 06:23
New changeset 7a465cb5ee7e298cae626ace1fc3e7d97df79f2e by Serhiy Storchaka in branch 'master': bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603) https://github.com/python/cpython/commit/7a465cb5ee7e298cae626ace1fc3e7d97df79f2e
msg339199 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-03-30 13:52
New changeset bd48280cb66544827952ca91e326cbb178c8c461 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.7': bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603) (GH-12627) https://github.com/python/cpython/commit/bd48280cb66544827952ca91e326cbb178c8c461
msg346152 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-06-20 18:24
This change seems to have caused test failure reported in https://github.com/python-hyper/wsproto/issues/126 from codecs import getincrementaldecoder decoder = getincrementaldecoder("utf-8")() print(decoder.decode(b'f\xf1\xf6rd', False)) # With this commit 7a465cb5ee ➜ cpython git:(7a465cb5ee) ./python.exe /tmp/foo.py f Before 7a465cb5ee ➜ cpython git:(38f4e468d4) ./python.exe /tmp/foo.py Traceback (most recent call last): File "/tmp/foo.py", line 3, in <module> print(decoder.decode(b'f\xf1\xf6rd', False)) File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1: invalid continuation byte
msg346248 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-21 21:06
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1: invalid continuation byte Python is right: b'f\xf1\xf6rd' is not a valid UTF-8 string: $ python3 Python 3.7.3 (default, May 11 2019, 00:38:04) >>> b'f\xf1\xf6rd'.decode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1: invalid continuation byte This change is deliberate: it makes UTF-8 incremental decoder correct (respect the UTF-8 standard). I close the issue.
msg346273 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-06-22 10:11
Victor, I think you misunderstood the issue. The problem is not that a decoding error is raised. The problem is that the incremental decoder no longer raises where it raised before. I think that both behavior may be correct, and that it is better to not rely on ability of the incremental decoder with final=False to detect an invalid encoded data, but I see now that it is possible to fix for the original issue more carefully, without changing that behavior. PR 14304 does this. It also change the UTF-16 incremental decoder with the surrogatepass error handler to return a non-empty data when decode a low surrogate with final=False. It is not necessary, but it makes all UTF-* decoders consistent and makes tests simpler.
msg346494 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-06-25 08:54
New changeset 894263ba80af4b7733c2df95b527e96953922656 by Serhiy Storchaka in branch 'master': bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) https://github.com/python/cpython/commit/894263ba80af4b7733c2df95b527e96953922656
msg346496 - (view)	Author: miss-islington (miss-islington)	Date: 2019-06-25 09:12
New changeset d32594ad27f48a898d42a0ea30b9d007b1c57de9 by Miss Islington (bot) in branch '3.8': bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) https://github.com/python/cpython/commit/d32594ad27f48a898d42a0ea30b9d007b1c57de9
msg346503 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-25 10:29
New changeset c755ca89c75252a7aae9beae82fd47787a76b9e2 by Victor Stinner (Miss Islington (bot)) in branch '3.7': [3.7] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) (GH-14369) https://github.com/python/cpython/commit/c755ca89c75252a7aae9beae82fd47787a76b9e2
msg346504 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-06-25 10:30
Thanks Karthikeyan Singaravelan for the bug report of the regression. You're right that I misunderstood it. Thanks Serhiy for the second fix. The regression should now be fixed as well, I close the issue again.
msg346822 - (view)	Author: Ned Deily (ned.deily) *	Date: 2019-06-28 14:47
Since the latest fix (GH-14369) addresses a regression between 3.7.3 and 3.7.4rc1, I am going to cherry-pick it into 3.7.4 final. (In retrospect, this regression should have been marked as a "release blocker".)
msg346838 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-06-28 17:18
Thanks Ned, is there a general policy on regressions to be marked as release blocker, like if a regression that was made in 3.7.3 then it acts as a release blocker for 3.7.4 or is it based on severity? ?
msg346839 - (view)	Author: Ned Deily (ned.deily) *	Date: 2019-06-28 17:31
"The only changes allowed to occur in a maintenance branch without debate are bug fixes. Also, a general rule for maintenance branches is that compatibility must not be broken at any point between sibling minor releases (3.5.1, 3.5.2, etc.). For both rules, only rare exceptions are accepted and must be discussed first." https://devguide.python.org/devcycle/#maintenance-branches The principle here is that we "promise" users that they can upgrade from any version of 3.n.x to the latest version of 3.n without needing to make any changes except in very rare cases when there is an overriding concern and which must be well-documented in the release materials. We're human so we sometimes slip up and inadvertently break that promise but that's the goal. And it's because of that promise that we can take the approach of immediately obsoleting previous older micro releases when a new micro release occurs, i.e. we don't provide fixes for 3.7.2 once 3.7.3 is released. So, from my perspective, pretty much any regression between micro releases is a release blocker but especially in a case like this where it can be addressed before a final release. That's basically why we do release candidates :)
msg346842 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-06-28 17:38
Thanks Ned for the details, in future reports I will also try to add the respective release manager to the issue in case of regressions.
msg346844 - (view)	Author: Ned Deily (ned.deily) *	Date: 2019-06-28 17:45
@xtreak, If you set the priority to "release blocker", the appropriate release managers will be nosied automatically. I'd rather see too many potential "release blocker"s than too few.
msg347163 - (view)	Author: Ned Deily (ned.deily) *	Date: 2019-07-02 22:34
New changeset 30c2ae4dcfd19acbdfb7045151c73d5700eec7b4 by Ned Deily (Miss Islington (bot)) in branch '3.7': [3.7] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) (GH-14369) https://github.com/python/cpython/commit/30c2ae4dcfd19acbdfb7045151c73d5700eec7b4

History
Date	User	Action	Args
2022-04-11 14:58:16	admin	set	github: 68402
2019-07-02 22:34:03	ned.deily	set	messages: + msg347163
2019-06-28 17:45:53	ned.deily	set	messages: + msg346844
2019-06-28 17:38:33	xtreak	set	messages: + msg346842
2019-06-28 17:31:13	ned.deily	set	messages: + msg346839
2019-06-28 17:18:44	xtreak	set	messages: + msg346838
2019-06-28 14:47:04	ned.deily	set	nosy: + ned.deily messages: + msg346822
2019-06-25 10:30:35	vstinner	set	status: open -> closed resolution: fixed messages: + msg346504 stage: patch review -> resolved
2019-06-25 10:29:28	vstinner	set	messages: + msg346503
2019-06-25 09:12:21	miss-islington	set	nosy: + miss-islington messages: + msg346496
2019-06-25 08:54:44	miss-islington	set	pull_requests: + pull_request14186
2019-06-25 08:54:37	miss-islington	set	pull_requests: + pull_request14185
2019-06-25 08:54:21	serhiy.storchaka	set	messages: + msg346494
2019-06-22 10:11:05	serhiy.storchaka	set	status: closed -> open versions: + Python 3.9 messages: + msg346273 resolution: fixed -> (no value) stage: resolved -> patch review
2019-06-22 09:57:37	serhiy.storchaka	set	pull_requests: + pull_request14128
2019-06-21 21:06:40	vstinner	set	status: open -> closed resolution: fixed messages: + msg346248
2019-06-21 18:26:38	roufique7	set	nosy: + roufique7
2019-06-20 22:03:57	vstinner	set	status: closed -> open resolution: fixed -> (no value)
2019-06-20 18:24:03	xtreak	set	nosy: + xtreak messages: + msg346152
2019-03-30 13:54:46	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2019-03-30 13:52:43	serhiy.storchaka	set	messages: + msg339199
2019-03-30 06:23:52	miss-islington	set	pull_requests: + pull_request12560
2019-03-30 06:23:42	serhiy.storchaka	set	messages: + msg339177
2019-03-28 13:28:40	serhiy.storchaka	set	messages: + msg339036
2019-03-28 13:27:43	serhiy.storchaka	set	versions: + Python 3.7, - Python 3.9
2019-03-28 13:27:09	serhiy.storchaka	set	pull_requests: + pull_request12543
2019-03-28 13:18:01	serhiy.storchaka	set	versions: + Python 3.8, Python 3.9, - Python 3.5, Python 3.6
2019-03-28 13:17:47	serhiy.storchaka	set	assignee: serhiy.storchaka
2019-03-28 11:40:57	methane	set	nosy: + methane
2016-08-02 18:53:51	serhiy.storchaka	set	priority: normal -> high messages: + msg271839 components: + Interpreter Core
2016-07-27 17:41:02	serhiy.storchaka	set	nosy: + serhiy.storchaka stage: patch review versions: + Python 3.5, - Python 3.4
2016-07-27 16:33:32	vstinner	set	title: Exception with utf-8, surrogatepass and incremental decoding -> UTF-8 incremental decoder doesn't support surrogatepass correctly
2016-07-27 16:33:22	vstinner	set	files: + surrogatepass.patch keywords: + patch messages: + msg271461
2016-07-26 20:50:35	RalfM	set	messages: + msg271412 versions: + Python 3.6
2015-05-16 22:56:34	RalfM	create