This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Misleading error message in str.decode()
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: doerwalter, ezio.melotti, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2018-10-08 15:38 by doerwalter, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (3)
msg327357 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2018-10-08 15:38
The following code issues a misleading exception message:

>>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

The cause for the exception is *not* an invalid continuation byte, but UTF-8 encoded surrogates. In fact using the 'surrogatepass' error handler doesn't raise an exception:

>>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8", "surrogatepass")
'\ud83d\udcde'

I would have expected an exception message like:

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: surrogates not allowed

(Note that the input bytes are an improperly UTF-8 encoded version of U+1F4DE (telephone receiver))
msg327358 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-08 15:52
This behavior is intentional, for conformance with the Unicode Standard
recommendations. See issue8271.
msg327362 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2018-10-08 16:48
OK, I see, http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (Table 3-7 on page 93) states that the only valid 3-bytes UTF-8 sequences starting with the byte 0xED have a value for the second byte in the range 0x80 to 0x9F. 0xA0 is just beyond that range (as that would result in an encoded surrogate). Python handles all invalid sequences according to that table with the same error message. I think this issue can be closed.
History
Date User Action Args
2022-04-11 14:59:06adminsetgithub: 79116
2018-10-10 08:40:35ezio.melottisetstatus: open -> closed
assignee: ezio.melotti
type: behavior
resolution: not a bug
stage: resolved
2018-10-08 16:48:24doerwaltersetmessages: + msg327362
2018-10-08 15:52:44serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg327358
2018-10-08 15:38:20doerwaltercreate