Issue 34935: Misleading error message in str.decode()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79116

classification

Title:	Misleading error message in str.decode()
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	doerwalter, ezio.melotti, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2018-10-08 15:38 by doerwalter, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (3)
msg327357 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2018-10-08 15:38
The following code issues a misleading exception message: >>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte The cause for the exception is not an invalid continuation byte, but UTF-8 encoded surrogates. In fact using the 'surrogatepass' error handler doesn't raise an exception: >>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8", "surrogatepass") '\ud83d\udcde' I would have expected an exception message like: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: surrogates not allowed (Note that the input bytes are an improperly UTF-8 encoded version of U+1F4DE (telephone receiver))
msg327358 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-10-08 15:52
This behavior is intentional, for conformance with the Unicode Standard recommendations. See issue8271.
msg327362 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2018-10-08 16:48
OK, I see, http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (Table 3-7 on page 93) states that the only valid 3-bytes UTF-8 sequences starting with the byte 0xED have a value for the second byte in the range 0x80 to 0x9F. 0xA0 is just beyond that range (as that would result in an encoded surrogate). Python handles all invalid sequences according to that table with the same error message. I think this issue can be closed.

History
Date	User	Action	Args
2022-04-11 14:59:06	admin	set	github: 79116
2018-10-10 08:40:35	ezio.melotti	set	status: open -> closed assignee: ezio.melotti type: behavior resolution: not a bug stage: resolved
2018-10-08 16:48:24	doerwalter	set	messages: + msg327362
2018-10-08 15:52:44	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg327358
2018-10-08 15:38:20	doerwalter	create