We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
GitHub fields:
assignee = None closed_at = <Date 2016-02-01.17:02:21.878> created_at = <Date 2016-02-01.16:40:22.437> labels = ['type-feature', 'invalid', 'expert-unicode'] title = 'utf8 decoding inconsistency between P2 and P3' updated_at = <Date 2016-02-01.17:02:21.878> user = 'https://bugs.python.org/jinz'
bugs.python.org fields:
activity = <Date 2016-02-01.17:02:21.878> actor = 'vstinner' assignee = 'none' closed = True closed_date = <Date 2016-02-01.17:02:21.878> closer = 'vstinner' components = ['Unicode'] creation = <Date 2016-02-01.16:40:22.437> creator = 'jinz' dependencies = [] files = [] hgrepos = [] issue_num = 26260 keywords = [] message_count = 3.0 messages = ['259329', '259330', '259331'] nosy_count = 3.0 nosy_names = ['vstinner', 'ezio.melotti', 'jinz'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = None status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue26260' versions = ['Python 2.7']
The text was updated successfully, but these errors were encountered:
PAYLOAD1 = b'\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5' PAYLOAD2 = b'\xed\xa0\x80' PAYLOAD3 = b'\x65\x64\x69\x74\x65\x64' PAYLOAD = PAYLOAD1 + PAYLOAD2 + PAYLOAD3
PAYLOAD.decode('utf8') passes in P2.7.* and fails in P3.4
Thank you for reading.
Sorry, something went wrong.
Well, Python 2 decoder didn't respect the Unicode standard. Please see: http://unicodebook.readthedocs.org/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates
Python 3 is now stricted. You can still decode surrogate characters if you need them *for a good reason* using:
>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogatepass') '\ud800'
By they way, there is also:
>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogateescape') '\udced\udca0\udc80'
which is very different but may also help.
I suggest to close the issue as NOT A BUG.
Thank you very much for your help!
No branches or pull requests
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: