Issue 21331: Reversing an encoding with unicode-escape returns a different result

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65530

classification

Title:	Reversing an encoding with unicode-escape returns a different result
Type:	behavior	Stage:
Components:	Unicode	Versions:	Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Sworddragon, ezio.melotti, lemburg, loewis, ncoghlan, r.david.murray, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2014-04-22 20:58 by Sworddragon, last changed 2022-04-11 14:58 by admin.

Messages (14)
msg217021 - (view)	Author: (Sworddragon)	Date: 2014-04-22 20:58
I have made some tests with encoding/decoding in conjunction with unicode-escape and got some strange results: >>> print('ä') ä >>> print('ä'.encode('utf-8')) b'\xc3\xa4' >>> print('ä'.encode('utf-8').decode('unicode-escape')) Ã¤ >>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape')) b'\\xc3\\xa4' >>> print('ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape').decode('utf-8')) \xc3\xa4 Shouldn't .decode('unicode-escape').encode('unicode-escape') nullify itself and so "'ä'.encode('utf-8').decode('unicode-escape').encode('unicode-escape')" return the same result as 'ä'.encode('utf-8')?
msg217024 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-04-22 21:13
No. x.encode('unicode-escape').decode('unicode-escape') should return the same result, and it does. The bug, I think, is that bytes.decode('unicode-escape') is not objecting to the non-ascii characters. It appears to be treating them as latin1, and that strikes me as broken.
msg217033 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-22 21:56
unicode_escape codec is deprecated since Python 3.3. Please use UTF-8 or something else.
msg217055 - (view)	Author: (Sworddragon)	Date: 2014-04-23 06:42
The documentation says that unicode_internal is deprecated since Python 3.3 but not unicode_escape. Also, isn't unicode_escape different from utf-8? For example my original intention was to convert 2 byte string characters to their control characters. For example the file test.txt contains the 17 byte utf-8 raw content "---a---\n---ä---". Now I want to convert '\\n' to '\n': >>> file = open('test.txt', 'r') >>> content = file.read() >>> file.close() >>> content = content.encode('utf-8').decode('unicode-escape') >>> print(content) ---a--- ---Ã¤--- I'm getting now successfully 2 lines but I have noticed not getting the ä anymore. After that I have made a deeper look and opened this ticket. If unicode_escape gets really deprecated maybe I could simply replace the characters 0-31 and 127 to achieve practically the same behavior.
msg217094 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-04-23 22:07
Using unicode_escape to decode non-ascii is simply wrong. It can't work.
msg217095 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-04-23 22:17
To understand why, understand that a byte string has no encoding inherent. So when you call b'utf8string'.decode('unicode_escape'), python has no way to know how to interpret the non-ascii characters in that bytestring. If you want the unicode_escape representation of something, you want to do 'string'.encode('unicode_escape'). If you then want that as a python string, you can do: 'mystring'.encode('unicode_escape').decode('ascii') In theory there ought to be a way to use the codecs module to go directly from unicode string to unicode-escaped string, but I don't know how to do it, since the proposal for the 'transform' method was rejected :) Just to bend your brain a bit further, note that this does work: >>> codecs.decode(codecs.encode('ä', 'unicode-escape').decode('ascii'), 'unicode-escape') 'ä'
msg217096 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-04-23 22:19
Also, I'm not sure what this should do, but what it does do doesn't look right: >>> codecs.decode('ä', 'unicode-escape') 'Ã¤'
msg218519 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-05-14 11:00
Sworddragon, try to use content.encode('ascii', 'backslashreplace').decode('unicode-escape'). It is too late to change the unicode-escape encoding.
msg221191 - (view)	Author: (Sworddragon)	Date: 2014-06-21 19:22
> It is too late to change the unicode-escape encoding. So it will stay at ISO-8859-1? If yes I think this ticket can be closed as wont fix.
msg221198 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-06-21 20:32
I disagree. The current decoder implementation is clearly incorrect: the unicode-escape encoding only uses bytes < 128. So decoding non-ascii bytes should fail. So the examples in msg217021 should all give UnicodeDecodeErrors. As this is an incompatible change, we need to deprecate the current behavior for 3.5, and change it in 3.6.
msg221204 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-06-21 21:21
The unicode-escape codec was used in Python 2 to convert Unicode literals in source code to Unicode objects. Before PEP 263, Unicode literals in source code were interpreted as Latin-1. See http://legacy.python.org/dev/peps/pep-0263/ for details. The implementation is correct, but doesn't necessarily match today's realities anymore.
msg221308 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-06-22 20:35
As you say, the unicode-escape codec is tied to the Python language definition. So if the language changes, the codec needs to change as well. A Unicode literal in source code might be using any encoding, so to be on the safe side, restricting it to ASCII is meaningful. Or else, if we want to use the default source encoding (as it did in 2.x), we should assume UTF-8 (per PEP 3120). Using ISO-8859-1 is clearly wrong for 3.x.
msg221442 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-06-24 09:44
Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility.
msg221447 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-06-24 10:08
On 24.06.2014 11:44, Serhiy Storchaka wrote: > > Note that 'raw-unicode-escape' is used in pickle protocol 0. Changing it can break compatibility. Indeed. unicode-escape was also designed to be able to read back raw-unicode-escape encoded data, so changing the decoder to not accept Latin-1 code points would break that as well. It may be better to simply create a new codec that rejects non-ASCII encoded bytes when decoding and perhaps call that 'unicode-repr'.

History
Date	User	Action	Args
2022-04-11 14:58:02	admin	set	github: 65530
2014-06-24 10:08:35	lemburg	set	messages: + msg221447
2014-06-24 09:44:53	serhiy.storchaka	set	messages: + msg221442
2014-06-22 20:35:08	loewis	set	messages: + msg221308
2014-06-21 21:21:50	lemburg	set	messages: + msg221204
2014-06-21 20:32:23	loewis	set	nosy: + loewis messages: + msg221198
2014-06-21 19:22:01	Sworddragon	set	status: pending -> open messages: + msg221191
2014-05-25 08:02:34	serhiy.storchaka	set	status: open -> pending
2014-05-14 11:00:50	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg218519
2014-04-23 22:19:16	r.david.murray	set	messages: + msg217096
2014-04-23 22:17:11	r.david.murray	set	messages: + msg217095
2014-04-23 22:07:41	r.david.murray	set	messages: + msg217094
2014-04-23 06:42:48	Sworddragon	set	messages: + msg217055
2014-04-22 21:56:14	vstinner	set	messages: + msg217033
2014-04-22 21:13:34	r.david.murray	set	nosy: + ncoghlan, r.david.murray, lemburg messages: + msg217024
2014-04-22 20:58:23	Sworddragon	create