classification
Title: utf8 decoding inconsistency between P2 and P3
Type: enhancement Stage:
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jinz, vstinner
Priority: normal Keywords:

Created on 2016-02-01 16:40 by jinz, last changed 2016-02-01 17:02 by vstinner. This issue is now closed.

Messages (3)
msg259329 - (view) Author: Jim Jin (jinz) Date: 2016-02-01 16:40
PAYLOAD1 = b'\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5'
   PAYLOAD2 = b'\xed\xa0\x80'  
   PAYLOAD3 = b'\x65\x64\x69\x74\x65\x64'
   PAYLOAD = PAYLOAD1 + PAYLOAD2 + PAYLOAD3

   PAYLOAD.decode('utf8')  passes in P2.7.* and fails in P3.4

   Thank you for reading.
msg259330 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-01 16:54
> PAYLOAD.decode('utf8')  passes in P2.7.* and fails in P3.4

Well, Python 2 decoder didn't respect the Unicode standard. Please see:
http://unicodebook.readthedocs.org/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Python 3 is now stricted. You can still decode surrogate characters if you need them *for a good reason* using:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogatepass')
'\ud800'

By they way, there is also:

>>> b'\xed\xa0\x80'.decode('utf-8', 'surrogateescape')
'\udced\udca0\udc80'

which is very different but may also help.

I suggest to close the issue as NOT A BUG.
msg259331 - (view) Author: Jim Jin (jinz) Date: 2016-02-01 16:57
Thank you very much for your help!
History
Date User Action Args
2016-02-01 17:02:21vstinnersetstatus: open -> closed
resolution: not a bug
2016-02-01 16:57:38jinzsetmessages: + msg259331
2016-02-01 16:54:25vstinnersetmessages: + msg259330
2016-02-01 16:40:22jinzcreate