New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738
Comments
It seems that the 'raw_unicode_escape' codec:
Python 3.3: >>> b = "zażółć".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c\xf3\\u0142\\u0107"'
>>> eval(literal)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
>>> b'\xf3'.decode('latin-1')
'ó'
>>> b = "zaż".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c"'
>>> eval(literal)
'za\\u017c'
>>> print(eval(literal))
za\u017c It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters. PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue bpo-7615) -- shouldn't it be at least documented explicitly? |
The 'raw_unicode_escape' codec can't be neither removed nor changed because it is used in pickle protocol. Just don't use it if its behavior looks weird for you. Right way to decode raw_unicode_escape-encoded data is use 'raw_unicode_escape' decoder. If a string don't contain quotes, you can use eval(), but you should first decode data from latin1 and encode to UTF-8: >>> literal = ('r"%s"' % "zażółć".encode('raw_unicode_escape').decode('latin1')).encode()
>>> literal
b'r"za\\u017c\xc3\xb3\\u0142\\u0107"'
>>> eval(literal)
'za\\u017có\\u0142\\u0107' |
Which means that the description "Produce a string that is suitable as raw Unicode literal in Python source code" is (in Python 3.x) no longer true. So, if change/removal is not possible because of internal significance of the codec, I believe that the description should be changed to something like: "For internal use. This codec *does not* produce anything suitable as a raw string literal in Python 3.x source code." |
Jan, the codec implements an encoding which has certain characteristics just like any other codec. It works both in Python 2 and 3 without problems. The documentation is no longer true, though. Ever since we added encoding markers to source files, the raw Unicode string literals depended on this encoding setting. Before this change the docs were fine, since Unicode literals were interpreted as Latin-1 encoded. More correct would be: "Produce a string that uses Unicode escapes to encode non-Latin-1 code points. It is used in the Python pickle protocol." |
I included the proposed doc fix in my patch for bpo-19548 |
[Edit Error: 'utf8' codec can't decode byte 0xe2 in position 212: invalid continuation byte] Re-reading the suggested description, it struck me that for encoding, this is redundant with the “backslashreplace” error handler: >>> test = "".join(map(chr, range(sys.maxunicode + 1)))
>>> test.encode("raw-unicode-escape") == test.encode("latin-1", "backslashreplace")
True However, decoding also seems similar to “unicode_escape”, except that only \uXXXX and \UXXXXXXXX seem to be supported. Maybe there should be a warning that backslashes are not escaped: >>> "\\u005C".encode("raw-unicode-escape").decode("raw-unicode-escape")
'\\' |
My concerns are now being addressed in the bpo-19548. |
This issue is just a documentation issue. The do must be more explicit, |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: