The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

zuo · 2013-11-10T02:51:45Z

BPO	19539
Nosy	@malemburg, @terryjreedy, @vstinner, @ezio-melotti, @vadmium
Superseder	bpo-19548: 'codecs' module docs improvements

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-12-26.00:31:41.928>
created_at = <Date 2013-11-10.02:51:45.316>
labels = ['type-feature', 'expert-unicode', 'docs']
title = "The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x"
updated_at = <Date 2014-12-28.08:48:33.623>
user = 'https://bugs.python.org/zuo'

bugs.python.org fields:

activity = <Date 2014-12-28.08:48:33.623>
actor = 'vstinner'
assignee = 'docs@python'
closed = True
closed_date = <Date 2014-12-26.00:31:41.928>
closer = 'zuo'
components = ['Documentation', 'Unicode']
creation = <Date 2013-11-10.02:51:45.316>
creator = 'zuo'
dependencies = []
files = []
hgrepos = []
issue_num = 19539
keywords = []
message_count = 8.0
messages = ['202505', '202507', '202591', '202643', '232851', '233010', '233102', '233147']
nosy_count = 7.0
nosy_names = ['lemburg', 'terry.reedy', 'vstinner', 'ezio.melotti', 'zuo', 'docs@python', 'martin.panter']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '19548'
type = 'enhancement'
url = 'https://bugs.python.org/issue19539'
versions = ['Python 3.3', 'Python 3.4']

zuo · 2013-11-10T02:51:43Z

It seems that the 'raw_unicode_escape' codec:

produces data that could be suitable for Python 2.x raw unicode string literals and not for Python 3.x raw unicode string literals (in Python 3.x \u... escapes are also treated literally);
seems to be buggy anyway: bytes in range 128-255 are encoded with the 'latin-1' encoding (in Python 3.x it is definitely a bug; and even in Python 2.x the feature is dubious, although at least the Py2's eval() and compile() functions officially accept 'latin-1'-encoded byte strings...).

Python 3.3:

>>> b = "zażółć".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c\xf3\\u0142\\u0107"'
>>> eval(literal)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
>>> b'\xf3'.decode('latin-1')
'ó'
>>> b = "zaż".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c"'
>>> eval(literal)
'za\\u017c'
>>> print(eval(literal))
za\u017c

It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters.

PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue bpo-7615) -- shouldn't it be at least documented explicitly?

serhiy-storchaka · 2013-11-10T07:05:29Z

The 'raw_unicode_escape' codec can't be neither removed nor changed because it is used in pickle protocol. Just don't use it if its behavior looks weird for you.

Right way to decode raw_unicode_escape-encoded data is use 'raw_unicode_escape' decoder.

If a string don't contain quotes, you can use eval(), but you should first decode data from latin1 and encode to UTF-8:

>>> literal = ('r"%s"' % "zażółć".encode('raw_unicode_escape').decode('latin1')).encode()
>>> literal
b'r"za\\u017c\xc3\xb3\\u0142\\u0107"'
>>> eval(literal)
'za\\u017có\\u0142\\u0107'

zuo · 2013-11-11T00:22:08Z

Which means that the description "Produce a string that is suitable as raw Unicode literal in Python source code" is (in Python 3.x) no longer true.

So, if change/removal is not possible because of internal significance of the codec, I believe that the description should be changed to something like: "For internal use. This codec *does not* produce anything suitable as a raw string literal in Python 3.x source code."

malemburg · 2013-11-11T19:19:23Z

Jan, the codec implements an encoding which has certain characteristics just like any other codec. It works both in Python 2 and 3 without problems.

The documentation is no longer true, though. Ever since we added encoding markers to source files, the raw Unicode string literals depended on this encoding setting. Before this change the docs were fine, since Unicode literals were interpreted as Latin-1 encoded.

More correct would be: "Produce a string that uses Unicode escapes to encode non-Latin-1 code points. It is used in the Python pickle protocol."

vadmium · 2014-12-18T02:27:23Z

I included the proposed doc fix in my patch for bpo-19548

vadmium · 2014-12-22T06:01:56Z

[Edit Error: 'utf8' codec can't decode byte 0xe2 in position 212: invalid continuation byte]

Re-reading the suggested description, it struck me that for encoding, this is redundant with the “backslashreplace” error handler:

>>> test = "".join(map(chr, range(sys.maxunicode + 1)))
>>> test.encode("raw-unicode-escape") == test.encode("latin-1", "backslashreplace")
True

However, decoding also seems similar to “unicode_escape”, except that only \uXXXX and \UXXXXXXXX seem to be supported.

Maybe there should be a warning that backslashes are not escaped:

>>> "\\u005C".encode("raw-unicode-escape").decode("raw-unicode-escape")
'\\'

zuo · 2014-12-26T00:33:00Z

My concerns are now being addressed in the bpo-19548.

vstinner · 2014-12-28T08:48:33Z

This issue is just a documentation issue. The do must be more explicit,
explain that the codecs is only used internally by the pickle module, and
that its output cannot be used anymore by eval().

zuo mannequin added stdlib Python modules in the Lib dir topic-unicode labels Nov 10, 2013

zuo mannequin added the docs Documentation in the Doc dir label Nov 11, 2013

zuo mannequin assigned docspython Nov 11, 2013

serhiy-storchaka added type-feature A feature request or enhancement and removed stdlib Python modules in the Lib dir labels Nov 11, 2013

malemburg changed the title ~~The 'raw_unicode_escape' codec buggy + not apropriate for Python 3.x~~ The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x Nov 11, 2013

zuo mannequin closed this as completed Dec 26, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

zuo mannequin commented Nov 10, 2013

zuo mannequin commented Nov 10, 2013

serhiy-storchaka commented Nov 10, 2013

zuo mannequin commented Nov 11, 2013

malemburg commented Nov 11, 2013

vadmium commented Dec 18, 2014

vadmium commented Dec 22, 2014

zuo mannequin commented Dec 26, 2014

vstinner commented Dec 28, 2014

The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

Comments

zuo mannequin commented Nov 10, 2013

zuo mannequin commented Nov 10, 2013

serhiy-storchaka commented Nov 10, 2013

zuo mannequin commented Nov 11, 2013

malemburg commented Nov 11, 2013

vadmium commented Dec 18, 2014

vadmium commented Dec 22, 2014

zuo mannequin commented Dec 26, 2014

vstinner commented Dec 28, 2014