classification
Title: Reversing an encoding with unicode-escape returns a different result
Type: behavior Stage:
Components: Unicode Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Sworddragon, ezio.melotti, lemburg, loewis, ncoghlan, r.david.murray, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2014-04-22 20:58 by Sworddragon, last changed 2014-06-24 10:08 by lemburg.

Messages (14)
msg217021 - (view) Author: (Sworddragon) Date: 2014-04-22 20:58
I have made some tests with encoding/decoding in conjunction with unicode-escape and got some strange results:

>>> print('ä')
ä
>>> print('ä'.encode('utf-8'))
b'\xc3\xa4'
>>> print('ä'.encode('utf-8').decode('unicode-escape'))
ä
>>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape'))
b'\\xc3\\xa4'
>>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape').decode('utf-8'))
\xc3\xa4


Shouldn't .decode('unicode-escape').encode('unicode-escape') nullify itself and so "'ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape')" return the same result as 'ä'.encode('utf-8')?
msg217024 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-04-22 21:13
No.  x.encode('unicode-escape').decode('unicode-escape') should return the same result, and it does.

The bug, I think, is that bytes.decode('unicode-escape') is not objecting to the non-ascii characters.  It appears to be treating them as latin1, and that strikes me as broken.
msg217033 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-22 21:56
unicode_escape codec is deprecated since Python 3.3. Please use UTF-8 or
something else.
msg217055 - (view) Author: (Sworddragon) Date: 2014-04-23 06:42
The documentation says that unicode_internal is deprecated since Python 3.3 but not unicode_escape. Also, isn't unicode_escape different from utf-8? For example my original intention was to convert 2 byte string characters to their control characters. For example the file test.txt contains the 17 byte utf-8 raw content "---a---\n---ä---". Now I want to convert '\\n' to '\n':

>>> file = open('test.txt', 'r')
>>> content = file.read()
>>> file.close()
>>> content = content.encode('utf-8').decode('unicode-escape')
>>> print(content)
---a---
---ä---


I'm getting now successfully 2 lines but I have noticed not getting the ä anymore. After that I have made a deeper look and opened this ticket.

If unicode_escape gets really deprecated maybe I could simply replace the characters 0-31 and 127 to achieve practically the same behavior.
msg217094 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-04-23 22:07
Using unicode_escape to decode non-ascii is simply wrong.  It can't work.
msg217095 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-04-23 22:17
To understand why, understand that a byte string has no encoding inherent.  So when you call b'utf8string'.decode('unicode_escape'), python has no way to know how to interpret the non-ascii characters in that bytestring.  If you want the unicode_escape representation of something, you want to do 'string'.encode('unicode_escape').  If you then want that as a python string, you can do:

    'mystring'.encode('unicode_escape').decode('ascii')

In theory there ought to be a way to use the codecs module to go directly from unicode string to unicode-escaped string, but I don't know how to do it, since the proposal for the 'transform' method was rejected :)

Just to bend your brain a bit further, note that this does work:

>>> codecs.decode(codecs.encode('ä', 'unicode-escape').decode('ascii'), 'unicode-escape')
'ä'
msg217096 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-04-23 22:19
Also, I'm not sure what this should do, but what it does do doesn't look right:

>>> codecs.decode('ä', 'unicode-escape')
'ä'
msg218519 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-05-14 11:00
Sworddragon, try to use content.encode('ascii', 'backslashreplace').decode('unicode-escape').

It is too late to change the unicode-escape encoding.
msg221191 - (view) Author: (Sworddragon) Date: 2014-06-21 19:22
> It is too late to change the unicode-escape encoding.

So it will stay at ISO-8859-1? If yes I think this ticket can be closed as wont fix.
msg221198 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-06-21 20:32
I disagree. The current decoder implementation is clearly incorrect: the unicode-escape encoding only uses bytes < 128. So decoding non-ascii bytes should fail. So the examples in msg217021 should all give UnicodeDecodeErrors.

As this is an incompatible change, we need to deprecate the current behavior for 3.5, and change it in 3.6.
msg221204 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-06-21 21:21
The unicode-escape codec was used in Python 2 to convert Unicode literals in source code to Unicode objects. Before PEP 263, Unicode literals in source code were interpreted as Latin-1. See http://legacy.python.org/dev/peps/pep-0263/ for details.

The implementation is correct, but doesn't necessarily match today's realities anymore.
msg221308 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-06-22 20:35
As you say, the unicode-escape codec is tied to the Python language definition. So if the language changes, the codec needs to change as well. 

A Unicode literal in source code might be using any encoding, so to be on the safe side, restricting it to ASCII is meaningful. Or else, if we want to use the default source encoding (as it did in 2.x), we should assume UTF-8 (per PEP 3120). Using ISO-8859-1 is clearly wrong for 3.x.
msg221442 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-06-24 09:44
Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility.
msg221447 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-06-24 10:08
On 24.06.2014 11:44, Serhiy Storchaka wrote:
> 
> Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility.

Indeed. unicode-escape was also designed to be able to read back
raw-unicode-escape encoded data, so changing the decoder to not
accept Latin-1 code points would break that as well.

It may be better to simply create a new codec that rejects
non-ASCII encoded bytes when decoding and perhaps call
that 'unicode-repr'.
History
Date User Action Args
2014-06-24 10:08:35lemburgsetmessages: + msg221447
2014-06-24 09:44:53serhiy.storchakasetmessages: + msg221442
2014-06-22 20:35:08loewissetmessages: + msg221308
2014-06-21 21:21:50lemburgsetmessages: + msg221204
2014-06-21 20:32:23loewissetnosy: + loewis
messages: + msg221198
2014-06-21 19:22:01Sworddragonsetstatus: pending -> open

messages: + msg221191
2014-05-25 08:02:34serhiy.storchakasetstatus: open -> pending
2014-05-14 11:00:50serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg218519
2014-04-23 22:19:16r.david.murraysetmessages: + msg217096
2014-04-23 22:17:11r.david.murraysetmessages: + msg217095
2014-04-23 22:07:41r.david.murraysetmessages: + msg217094
2014-04-23 06:42:48Sworddragonsetmessages: + msg217055
2014-04-22 21:56:14vstinnersetmessages: + msg217033
2014-04-22 21:13:34r.david.murraysetnosy: + ncoghlan, r.david.murray, lemburg
messages: + msg217024
2014-04-22 20:58:23Sworddragoncreate