Unicode escape sequences not parsed in raw strings. #46793

jmillikin · 2008-04-03T04:19:03Z

BPO	2541
Nosy	@malemburg, @gvanrossum, @birkenfeld, @amauryfa, @benjaminp, @jmillikin
Files	py3k_raw_strings_unicode_escapes.patch py3k_raw_strings_unicode_escapes2.patch py3k_raw_strings_unicode_escapes3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/benjaminp'
closed_at = <Date 2008-04-28.21:05:55.867>
created_at = <Date 2008-04-03.04:19:03.157>
labels = ['type-bug', 'docs']
title = 'Unicode escape sequences not parsed in raw strings.'
updated_at = <Date 2008-04-28.21:05:55.865>
user = 'https://github.com/jmillikin'

bugs.python.org fields:

activity = <Date 2008-04-28.21:05:55.865>
actor = 'benjamin.peterson'
assignee = 'benjamin.peterson'
closed = True
closed_date = <Date 2008-04-28.21:05:55.867>
closer = 'benjamin.peterson'
components = ['Documentation']
creation = <Date 2008-04-03.04:19:03.157>
creator = 'jmillikin'
dependencies = []
files = ['9947', '9948', '9952']
hgrepos = []
issue_num = 2541
keywords = ['patch']
message_count = 24.0
messages = ['64890', '64896', '64897', '64898', '64900', '64978', '64982', '64984', '64985', '64986', '64990', '64997', '65009', '65083', '65085', '65211', '65212', '65223', '65225', '65234', '65502', '65512', '65930', '65934']
nosy_count = 7.0
nosy_names = ['lemburg', 'gvanrossum', 'nnorwitz', 'georg.brandl', 'amaury.forgeotdarc', 'benjamin.peterson', 'jmillikin']
pr_nums = []
priority = 'critical'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue2541'
versions = ['Python 3.0']

jmillikin · 2008-04-03T04:19:03Z

According to
<http://docs.python.org/dev/3.0/reference/lexical_analysis.html#id9\>,
raw strings with \u and \U escape sequences should have these sequences
parsed as usual. However, they are currently escaped.

>>> r'\u0020'
'\\u0020'

Expected:
>>> r'\u0020'
' '

benjaminp · 2008-04-03T12:54:46Z

You use the "ur" string mode.

>>> print ur"\u0020"
" "

amauryfa · 2008-04-03T13:15:00Z

No, it's about python 3.0. I confirm the problem, and propose a patch:

--- Python/ast.c.original       2008-04-03 15:12:15.548389400 +0200
+++ Python/ast.c        2008-04-03 15:12:28.359475800 +0200
@@ -3232,7 +3232,7 @@
             return NULL;
         }
     }
-    if (!*bytesmode && !rawmode) {
+    if (!*bytesmode) {
         return decode_unicode(s, len, rawmode, encoding);
     }
     if (*bytesmode) {

benjaminp · 2008-04-03T13:22:52Z

Thanks for noticing, Amaury, and your patch works for me.

benjaminp · 2008-04-03T16:27:40Z

Fixed in r62128.

benjaminp · 2008-04-05T14:52:07Z

Sorry, Guido said this is not allowed:
http://mail.python.org/pipermail/python-3000/2008-April/012952.html. I
reverted it in r62165.

gvanrossum · 2008-04-05T15:26:00Z

The docs still need to be updated! An entry in what's new in 3.0 should
also be added.

benjaminp · 2008-04-05T15:35:18Z

How's this?

gvanrossum · 2008-04-05T16:29:16Z

Instead of "ignored" (which might be read ambiguously) how about "not
treated specially"?

You also still need to add some words to whatsnew.

benjaminp · 2008-04-05T16:49:25Z

"not treated specially" it is!

birkenfeld · 2008-04-05T17:03:02Z

The segment "use different rules for interpreting backslash escape
sequences." should be killed entirely, and the whole rule told here.

Also, a few paragraphs later there are more references to raw strings,
e.g. "When an 'r' or 'R' prefix is used in a string literal,"
which need to be fixed too.

benjaminp · 2008-04-05T18:52:21Z

I made the requested improvements and mentioned it in NEWS. Is there
worth putting in the tutorial, since it mentions Unicode strings and raw
strings?

amauryfa · 2008-04-05T21:46:20Z

What about the "raw-unicode-escape" codec?
Can we leave it different from raw strings literals?

gvanrossum · 2008-04-07T17:54:11Z

To be honest, I don't know what the uses are for that codec.

amauryfa · 2008-04-07T18:02:55Z

pickle still uses it when protocol=0 (and cPickle as well, but in trunk/
only of course)

malemburg · 2008-04-08T20:03:24Z

You can't change the codec - it's being used in other places as well,
e.g. for use cases where you need to have an 8-bit encoded readable
version of a Unicode object (which happens to be Latin-1 + Unicode
escapes for all non-Latin-1 characters, due to Unicode being a superset
of Latin-1).

Adding a new codec would be fine, though I don't know how this would map
raw Unicode strings with non-Latin-1 characters in them to an 8-bit
string. Perhaps this is not needed at all in Py3k.

amauryfa · 2008-04-08T20:10:29Z

Isn't "unicode-escape" enough for this purpose?

malemburg · 2008-04-08T23:16:43Z

What do you mean with "enough" ?

The "raw-unicode-escape" codec is used in Python 2.x to convert literal
strings of the form ur"" to Unicode objects. It's a variant of the
"unicode-escape" codec.

The codec is also being used in cPickle, pickle, variants of pickle,
Python code generators, etc.

It serves its purpose, just like "unicode-escape" and all the other
codecs in Python.

amauryfa · 2008-04-08T23:55:38Z

I mean: now that raw strings cannot represent all unicode points (or
more precisely, they need the file encoding to do so), is there a use
case for "raw-unicode-escape" that cannot be filled by the
unicode-escape codec?

Note that pickle does not use "raw-unicode-escape" as is: it replaces
backslashes by \u005c. This has the nice effect that pickled strings can
also be decoded by "unicode-escape".

That's why I propose to completely remove raw-unicode-escape, and use
unicode-escape instead.

malemburg · 2008-04-09T10:03:24Z

While that's true for cPickle, it is not for pickle. The pickle protocol
itself is defined in terms of the "raw-unicode-escape" codec (see
pickle.py).

Besides, you cannot assume that the Python interpreter itself is the
only use-case for these codecs. The "raw-unicode-escape" codec is well
usable for other purposes where you need a compact way of encoding
Unicode, especially if you're strings are mostly Latin-1 and only
include non-UCS2 code points every now and then. That's also the reason
why pickle uses it.

nnorwitz · 2008-04-15T06:09:35Z

What is the status of this bug? AFAICT, the code is now correct. Have
the doc changes been applied? The resolution on this report should be
updated too. It's currently rejected.

benjaminp · 2008-04-15T11:51:25Z

It's rejected because the OP wanted unicode escapes to be applied in
unicode strings, and I haven't applied the docs because nobody has told
me I should.

birkenfeld · 2008-04-28T19:57:19Z

Please apply the patch, but rename "Unicode escapes" to "\u and \U
escapes" first.

benjaminp · 2008-04-28T21:05:56Z

Fixed in r62568.

jmillikin mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 3, 2008

benjaminp closed this as completed Apr 3, 2008

benjaminp added the invalid label Apr 3, 2008

amauryfa reopened this Apr 3, 2008

amauryfa removed the invalid label Apr 3, 2008

benjaminp closed this as completed Apr 3, 2008

gvanrossum added docs Documentation in the Doc dir and removed topic-unicode labels Apr 5, 2008

gvanrossum assigned birkenfeld Apr 5, 2008

gvanrossum reopened this Apr 5, 2008

birkenfeld assigned benjaminp and unassigned birkenfeld Apr 28, 2008

benjaminp closed this as completed Apr 28, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode escape sequences not parsed in raw strings. #46793

Unicode escape sequences not parsed in raw strings. #46793

jmillikin mannequin commented Apr 3, 2008

jmillikin mannequin commented Apr 3, 2008

benjaminp commented Apr 3, 2008

amauryfa commented Apr 3, 2008

benjaminp commented Apr 3, 2008

benjaminp commented Apr 3, 2008

benjaminp commented Apr 5, 2008

gvanrossum commented Apr 5, 2008

benjaminp commented Apr 5, 2008

gvanrossum commented Apr 5, 2008

benjaminp commented Apr 5, 2008

birkenfeld commented Apr 5, 2008

benjaminp commented Apr 5, 2008

amauryfa commented Apr 5, 2008

gvanrossum commented Apr 7, 2008

amauryfa commented Apr 7, 2008

malemburg commented Apr 8, 2008

amauryfa commented Apr 8, 2008

malemburg commented Apr 8, 2008

amauryfa commented Apr 8, 2008

malemburg commented Apr 9, 2008

nnorwitz mannequin commented Apr 15, 2008

benjaminp commented Apr 15, 2008

birkenfeld commented Apr 28, 2008

benjaminp commented Apr 28, 2008

Unicode escape sequences not parsed in raw strings. #46793

Unicode escape sequences not parsed in raw strings. #46793

Comments

jmillikin mannequin commented Apr 3, 2008

jmillikin mannequin commented Apr 3, 2008

benjaminp commented Apr 3, 2008

amauryfa commented Apr 3, 2008

benjaminp commented Apr 3, 2008

benjaminp commented Apr 3, 2008

benjaminp commented Apr 5, 2008

gvanrossum commented Apr 5, 2008

benjaminp commented Apr 5, 2008

gvanrossum commented Apr 5, 2008

benjaminp commented Apr 5, 2008

birkenfeld commented Apr 5, 2008

benjaminp commented Apr 5, 2008

amauryfa commented Apr 5, 2008

gvanrossum commented Apr 7, 2008

amauryfa commented Apr 7, 2008

malemburg commented Apr 8, 2008

amauryfa commented Apr 8, 2008

malemburg commented Apr 8, 2008

amauryfa commented Apr 8, 2008

malemburg commented Apr 9, 2008

nnorwitz mannequin commented Apr 15, 2008

benjaminp commented Apr 15, 2008

birkenfeld commented Apr 28, 2008

benjaminp commented Apr 28, 2008