Message 202505 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	zuo
Recipients	ezio.melotti, vstinner, zuo
Date	2013-11-10.02:51:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1384051905.56.0.969775270474.issue19539@psf.upfronthosting.co.za>
In-reply-to

Content
It seems that the 'raw_unicode_escape' codec: 1) produces data that could be suitable for Python 2.x raw unicode string literals and not for Python 3.x raw unicode string literals (in Python 3.x \u... escapes are also treated literally); 2) seems to be buggy anyway: bytes in range 128-255 are encoded with the 'latin-1' encoding (in Python 3.x it is definitely a bug; and even in Python 2.x the feature is dubious, although at least the Py2's eval() and compile() functions officially accept 'latin-1'-encoded byte strings...). Python 3.3: >>> b = "zażółć".encode('raw_unicode_escape') >>> literal = b'r"' + b + b'"' >>> literal b'r"za\\u017c\xf3\\u0142\\u0107"' >>> eval(literal) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1 SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte >>> b'\xf3'.decode('latin-1') 'ó' >>> b = "zaż".encode('raw_unicode_escape') >>> literal = b'r"' + b + b'"' >>> literal b'r"za\\u017c"' >>> eval(literal) 'za\\u017c' >>> print(eval(literal)) za\u017c It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters. PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue #7615) -- shouldn't it be at least documented explicitly?

It seems that the 'raw_unicode_escape' codec:

1) produces data that could be suitable for Python 2.x raw unicode string literals and not for Python 3.x raw unicode string literals (in Python 3.x \u... escapes are also treated literally);

2) seems to be buggy anyway: bytes in range 128-255 are encoded with the 'latin-1' encoding (in Python 3.x it is definitely a bug; and even in Python 2.x the feature is dubious, although at least the Py2's eval() and compile() functions officially accept 'latin-1'-encoded byte strings...).

Python 3.3:

>>> b = "zażółć".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c\xf3\\u0142\\u0107"'
>>> eval(literal)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
>>> b'\xf3'.decode('latin-1')
'ó'
>>> b = "zaż".encode('raw_unicode_escape')
>>> literal = b'r"' + b + b'"'
>>> literal
b'r"za\\u017c"'
>>> eval(literal)
'za\\u017c'
>>> print(eval(literal))
za\u017c

It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters.


PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue #7615) -- shouldn't it be at least documented explicitly?

History
Date	User	Action	Args
2013-11-10 02:51:46	zuo	set	recipients: + zuo, vstinner, ezio.melotti
2013-11-10 02:51:45	zuo	set	messageid: <1384051905.56.0.969775270474.issue19539@psf.upfronthosting.co.za>
2013-11-10 02:51:45	zuo	link	issue19539 messages
2013-11-10 02:51:42	zuo	create