Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 decoding inconsistency between P2 and P3 #70448

Closed
jinz mannequin opened this issue Feb 1, 2016 · 3 comments
Closed

utf8 decoding inconsistency between P2 and P3 #70448

jinz mannequin opened this issue Feb 1, 2016 · 3 comments
Labels
topic-unicode type-feature A feature request or enhancement

Comments

@jinz
Copy link
Mannequin

jinz mannequin commented Feb 1, 2016

BPO 26260
Nosy @vstinner, @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-02-01.17:02:21.878>
created_at = <Date 2016-02-01.16:40:22.437>
labels = ['type-feature', 'invalid', 'expert-unicode']
title = 'utf8 decoding inconsistency between P2 and P3'
updated_at = <Date 2016-02-01.17:02:21.878>
user = 'https://bugs.python.org/jinz'

bugs.python.org fields:

activity = <Date 2016-02-01.17:02:21.878>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2016-02-01.17:02:21.878>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2016-02-01.16:40:22.437>
creator = 'jinz'
dependencies = []
files = []
hgrepos = []
issue_num = 26260
keywords = []
message_count = 3.0
messages = ['259329', '259330', '259331']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'jinz']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue26260'
versions = ['Python 2.7']

@jinz
Copy link
Mannequin Author

jinz mannequin commented Feb 1, 2016

PAYLOAD1 = b'\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5'
   PAYLOAD2 = b'\xed\xa0\x80'  
   PAYLOAD3 = b'\x65\x64\x69\x74\x65\x64'
   PAYLOAD = PAYLOAD1 + PAYLOAD2 + PAYLOAD3

PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4

Thank you for reading.

@jinz jinz mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 1, 2016
@vstinner
Copy link
Member

vstinner commented Feb 1, 2016

PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4

Well, Python 2 decoder didn't respect the Unicode standard. Please see:
http://unicodebook.readthedocs.org/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Python 3 is now stricted. You can still decode surrogate characters if you need them *for a good reason* using:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogatepass')
'\ud800'

By they way, there is also:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogateescape')
'\udced\udca0\udc80'

which is very different but may also help.

I suggest to close the issue as NOT A BUG.

@jinz
Copy link
Mannequin Author

jinz mannequin commented Feb 1, 2016

Thank you very much for your help!

@vstinner vstinner closed this as completed Feb 1, 2016
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

1 participant