Issue 26260: utf8 decoding inconsistency between P2 and P3

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70448

classification

Title:	utf8 decoding inconsistency between P2 and P3
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 2.7

process

Created on 2016-02-01 16:40 by jinz, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (3)
msg259329 - (view)	Author: Jim Jin (jinz)	Date: 2016-02-01 16:40
PAYLOAD1 = b'\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5' PAYLOAD2 = b'\xed\xa0\x80' PAYLOAD3 = b'\x65\x64\x69\x74\x65\x64' PAYLOAD = PAYLOAD1 + PAYLOAD2 + PAYLOAD3 PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4 Thank you for reading.
msg259330 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-01 16:54
> PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4 Well, Python 2 decoder didn't respect the Unicode standard. Please see: http://unicodebook.readthedocs.org/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates Python 3 is now stricted. You can still decode surrogate characters if you need them for a good reason using: >>> b'\xed\xa0\x80'.decode('utf-8', 'surrogatepass') '\ud800' By they way, there is also: >>> b'\xed\xa0\x80'.decode('utf-8', 'surrogateescape') '\udced\udca0\udc80' which is very different but may also help. I suggest to close the issue as NOT A BUG.
msg259331 - (view)	Author: Jim Jin (jinz)	Date: 2016-02-01 16:57
Thank you very much for your help!

History
Date	User	Action	Args
2022-04-11 14:58:27	admin	set	github: 70448
2016-02-01 17:02:21	vstinner	set	status: open -> closed resolution: not a bug
2016-02-01 16:57:38	jinz	set	messages: + msg259331
2016-02-01 16:54:25	vstinner	set	messages: + msg259330
2016-02-01 16:40:22	jinz	create