utf8 decoding inconsistency between P2 and P3 #70448

jinz · 2016-02-01T16:40:22Z

BPO	26260
Nosy	@vstinner, @ezio-melotti

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-02-01.17:02:21.878>
created_at = <Date 2016-02-01.16:40:22.437>
labels = ['type-feature', 'invalid', 'expert-unicode']
title = 'utf8 decoding inconsistency between P2 and P3'
updated_at = <Date 2016-02-01.17:02:21.878>
user = 'https://bugs.python.org/jinz'

bugs.python.org fields:

activity = <Date 2016-02-01.17:02:21.878>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2016-02-01.17:02:21.878>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2016-02-01.16:40:22.437>
creator = 'jinz'
dependencies = []
files = []
hgrepos = []
issue_num = 26260
keywords = []
message_count = 3.0
messages = ['259329', '259330', '259331']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'jinz']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue26260'
versions = ['Python 2.7']

jinz · 2016-02-01T16:40:22Z

PAYLOAD1 = b'\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5'
   PAYLOAD2 = b'\xed\xa0\x80'  
   PAYLOAD3 = b'\x65\x64\x69\x74\x65\x64'
   PAYLOAD = PAYLOAD1 + PAYLOAD2 + PAYLOAD3

PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4

Thank you for reading.

vstinner · 2016-02-01T16:54:26Z

PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4

Well, Python 2 decoder didn't respect the Unicode standard. Please see:
http://unicodebook.readthedocs.org/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Python 3 is now stricted. You can still decode surrogate characters if you need them *for a good reason* using:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogatepass')
'\ud800'

By they way, there is also:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogateescape')
'\udced\udca0\udc80'

which is very different but may also help.

I suggest to close the issue as NOT A BUG.

jinz · 2016-02-01T16:57:38Z

Thank you very much for your help!

jinz mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 1, 2016

vstinner closed this as completed Feb 1, 2016

vstinner added the invalid label Feb 1, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8 decoding inconsistency between P2 and P3 #70448

utf8 decoding inconsistency between P2 and P3 #70448

jinz mannequin commented Feb 1, 2016

jinz mannequin commented Feb 1, 2016

vstinner commented Feb 1, 2016

jinz mannequin commented Feb 1, 2016

utf8 decoding inconsistency between P2 and P3 #70448

utf8 decoding inconsistency between P2 and P3 #70448

Comments

jinz mannequin commented Feb 1, 2016

jinz mannequin commented Feb 1, 2016

vstinner commented Feb 1, 2016

jinz mannequin commented Feb 1, 2016