classification
Title: The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x
Type: enhancement Stage: resolved
Components: Documentation, Unicode Versions: Python 3.3, Python 3.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: 'codecs' module docs improvements
View: 19548
Assigned To: docs@python Nosy List: docs@python, ezio.melotti, lemburg, martin.panter, terry.reedy, vstinner, zuo
Priority: normal Keywords:

Created on 2013-11-10 02:51 by zuo, last changed 2014-12-28 08:48 by vstinner. This issue is now closed.

Messages (8)
msg202505 - (view) Author: Jan Kaliszewski (zuo) Date: 2013-11-10 02:51
It seems that the 'raw_unicode_escape' codec:

1) produces data that could be suitable for Python 2.x raw unicode string literals and not for Python 3.x raw unicode string literals (in Python 3.x \u... escapes are also treated literally);

2) seems to be buggy anyway: bytes in range 128-255 are encoded with the 'latin-1' encoding (in Python 3.x it is definitely a bug; and even in Python 2.x the feature is dubious, although at least the Py2's eval() and compile() functions officially accept 'latin-1'-encoded byte strings...).

Python 3.3:

>>> b = "zażółć".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c\xf3\\u0142\\u0107"'
>>> eval(literal)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
>>> b'\xf3'.decode('latin-1')
'ó'
>>> b = "zaż".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c"'
>>> eval(literal)
'za\\u017c'
>>> print(eval(literal))
za\u017c

It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters.


PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue #7615) -- shouldn't it be at least documented explicitly?
msg202507 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-10 07:05
The 'raw_unicode_escape' codec can't be neither removed nor changed because it is used in pickle protocol. Just don't use it if its behavior looks weird for you.

Right way to decode raw_unicode_escape-encoded data is use 'raw_unicode_escape' decoder.

If a string don't contain quotes, you can use eval(), but you should first decode data from latin1 and encode to UTF-8:

>>> literal = ('r"%s"' % "zażółć".encode('raw_unicode_escape').decode('latin1')).encode()
>>> literal
b'r"za\\u017c\xc3\xb3\\u0142\\u0107"'
>>> eval(literal)
'za\\u017có\\u0142\\u0107'
msg202591 - (view) Author: Jan Kaliszewski (zuo) Date: 2013-11-11 00:22
Which means that the description "Produce a string that is suitable as raw Unicode literal in Python source code" is (in Python 3.x) no longer true.

So, if change/removal is not possible because of internal significance of the codec, I believe that the description should be changed to something like: "For internal use. This codec *does not* produce anything suitable as a raw string literal in Python 3.x source code."
msg202643 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-11-11 19:19
Jan, the codec implements an encoding which has certain characteristics just like any other codec. It works both in Python 2 and 3 without problems.

The documentation is no longer true, though. Ever since we added encoding markers to source files, the raw Unicode string literals depended on this encoding setting. Before this change the docs were fine, since Unicode literals were interpreted as Latin-1 encoded.

More correct would be: "Produce a string that uses Unicode escapes to encode non-Latin-1 code points. It is used in the Python pickle protocol."
msg232851 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-12-18 02:27
I included the proposed doc fix in my patch for Issue 19548
msg233010 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-12-22 06:01
[Edit Error: 'utf8' codec can't decode byte 0xe2 in position 212: invalid continuation byte]


Re-reading the suggested description, it struck me that for encoding, this is redundant with the “backslashreplace” error handler:

>>> test = "".join(map(chr, range(sys.maxunicode + 1)))
>>> test.encode("raw-unicode-escape") == test.encode("latin-1", "backslashreplace")
True

However, decoding also seems similar to “unicode_escape”, except that only \uXXXX and \UXXXXXXXX seem to be supported.

Maybe there should be a warning that backslashes are not escaped:

>>> "\\u005C".encode("raw-unicode-escape").decode("raw-unicode-escape")
'\\'
msg233102 - (view) Author: Jan Kaliszewski (zuo) Date: 2014-12-26 00:33
My concerns are now being addressed in the issue19548.
msg233147 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-12-28 08:48
This issue is just a documentation issue. The do must be more explicit,
explain that the codecs is only used internally by the pickle module, and
that its output cannot be used anymore by eval().
History
Date User Action Args
2014-12-28 08:48:33vstinnersetmessages: + msg233147
2014-12-27 21:04:02berker.peksagsetsuperseder: 'codecs' module docs improvements
stage: needs patch -> resolved
2014-12-26 00:33:00zuosetmessages: + msg233102
2014-12-26 00:31:41zuosetstatus: open -> closed
resolution: duplicate
2014-12-22 06:01:56martin.pantersetmessages: + msg233010
2014-12-18 02:27:23martin.pantersetnosy: + martin.panter
messages: + msg232851
2013-11-16 00:42:02terry.reedysetnosy: + terry.reedy
2013-11-11 19:19:23lemburgsetnosy: + lemburg

messages: + msg202643
title: The 'raw_unicode_escape' codec buggy + not apropriate for Python 3.x -> The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x
2013-11-11 18:23:39serhiy.storchakasetversions: - Python 3.2, Python 3.5
nosy: - serhiy.storchaka

components: - Library (Lib)
type: enhancement
stage: needs patch
2013-11-11 00:22:07zuosetversions: + Python 3.2, Python 3.3
nosy: + docs@python

messages: + msg202591

assignee: docs@python
components: + Documentation
2013-11-10 07:05:29serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg202507
2013-11-10 02:51:45zuocreate