classification
Title: Unicode escape sequences not parsed in raw strings.
Type: behavior Stage:
Components: Documentation Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: benjamin.peterson Nosy List: amaury.forgeotdarc, benjamin.peterson, georg.brandl, gvanrossum, jmillikin, lemburg, nnorwitz
Priority: critical Keywords: patch

Created on 2008-04-03 04:19 by jmillikin, last changed 2008-04-28 21:05 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
py3k_raw_strings_unicode_escapes.patch benjamin.peterson, 2008-04-05 15:35
py3k_raw_strings_unicode_escapes2.patch benjamin.peterson, 2008-04-05 16:49
py3k_raw_strings_unicode_escapes3.patch benjamin.peterson, 2008-04-05 18:52
Messages (24)
msg64890 - (view) Author: John Millikin (jmillikin) Date: 2008-04-03 04:19
According to
<http://docs.python.org/dev/3.0/reference/lexical_analysis.html#id9>,
raw strings with \u and \U escape sequences should have these sequences
parsed as usual. However, they are currently escaped.

>>> r'\u0020'
'\\u0020'

Expected:
>>> r'\u0020'
' '
msg64896 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-03 12:54
You use the "ur" string mode.

>>> print ur"\u0020"
" "
msg64897 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-03 13:15
No, it's about python 3.0. I confirm the problem, and propose a patch:

--- Python/ast.c.original       2008-04-03 15:12:15.548389400 +0200
+++ Python/ast.c        2008-04-03 15:12:28.359475800 +0200
@@ -3232,7 +3232,7 @@
             return NULL;
         }
     }
-    if (!*bytesmode && !rawmode) {
+    if (!*bytesmode) {
         return decode_unicode(s, len, rawmode, encoding);
     }
     if (*bytesmode) {
msg64898 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-03 13:22
Thanks for noticing, Amaury, and your patch works for me.
msg64900 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-03 16:27
Fixed in r62128.
msg64978 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-05 14:52
Sorry, Guido said this is not allowed:
http://mail.python.org/pipermail/python-3000/2008-April/012952.html. I
reverted it in r62165.
msg64982 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-05 15:26
The docs still need to be updated!  An entry in what's new in 3.0 should
also be added.
msg64984 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-05 15:35
How's this?
msg64985 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-05 16:29
Instead of "ignored" (which might be read ambiguously) how about "not
treated specially"?

You also still need to add some words to whatsnew.
msg64986 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-05 16:49
"not treated specially" it is!
msg64990 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-04-05 17:03
The segment "use different rules for interpreting backslash escape
sequences." should be killed entirely, and the whole rule told here.

Also, a few paragraphs later there are more references to raw strings,
e.g. "When an ``'r'`` or ``'R'`` prefix is used in a string literal,"
which need to be fixed too.
msg64997 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-05 18:52
I made the requested improvements and mentioned it in NEWS. Is there
worth putting in the tutorial, since it mentions Unicode strings and raw
strings?
msg65009 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-05 21:46
What about the "raw-unicode-escape" codec?
Can we leave it different from raw strings literals?
msg65083 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-07 17:54
To be honest, I don't know what the uses are for that codec.
msg65085 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-07 18:02
pickle still uses it when protocol=0 (and cPickle as well, but in trunk/
only of course)
msg65211 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-08 20:03
You can't change the codec - it's being used in other places as well,
e.g. for use cases where you need to have an 8-bit encoded readable
version of a Unicode object (which happens to be Latin-1 + Unicode
escapes for all non-Latin-1 characters, due to Unicode being a superset
of Latin-1).

Adding a new codec would be fine, though I don't know how this would map
raw Unicode strings with non-Latin-1 characters in them to an 8-bit
string. Perhaps this is not needed at all in Py3k.
msg65212 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-08 20:10
Isn't "unicode-escape" enough for this purpose?
msg65223 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-08 23:16
What do you mean with "enough" ?

The "raw-unicode-escape" codec is used in Python 2.x to convert literal
strings of the form ur"" to Unicode objects. It's a variant of the
"unicode-escape" codec.

The codec is also being used in cPickle, pickle, variants of pickle, 
Python code generators, etc.

It serves its purpose, just like "unicode-escape" and all the other
codecs in Python.
msg65225 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-08 23:55
I mean: now that raw strings cannot represent all unicode points (or
more precisely, they need the file encoding to do so), is there a use
case for "raw-unicode-escape" that cannot be filled by the
unicode-escape codec?

Note that pickle does not use "raw-unicode-escape" as is: it replaces
backslashes by \u005c. This has the nice effect that pickled strings can
also be decoded by "unicode-escape".

That's why I propose to completely remove raw-unicode-escape, and use
unicode-escape instead.
msg65234 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-09 10:03
While that's true for cPickle, it is not for pickle. The pickle protocol
itself is defined in terms of the "raw-unicode-escape" codec (see
pickle.py).

Besides, you cannot assume that the Python interpreter itself is the
only use-case for these codecs. The "raw-unicode-escape" codec is well
usable for other purposes where you need a compact way of encoding
Unicode, especially if you're strings are mostly Latin-1 and only
include non-UCS2 code points every now and then. That's also the reason
why pickle uses it.
msg65502 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2008-04-15 06:09
What is the status of this bug?  AFAICT, the code is now correct.  Have
the doc changes been applied?  The resolution on this report should be
updated too.  It's currently rejected.
msg65512 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-15 11:51
It's rejected because the OP wanted unicode escapes to be applied in
unicode strings, and I haven't applied the docs because nobody has told
me I should.
msg65930 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-04-28 19:57
Please apply the patch, but rename "Unicode escapes" to "\u and \U
escapes" first.
msg65934 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-28 21:05
Fixed in r62568.
History
Date User Action Args
2008-04-28 21:05:55benjamin.petersonsetstatus: open -> closed
messages: + msg65934
2008-04-28 19:57:35georg.brandlsetassignee: georg.brandl -> benjamin.peterson
resolution: rejected -> fixed
messages: + msg65930
2008-04-15 11:51:25benjamin.petersonsetmessages: + msg65512
2008-04-15 06:09:35nnorwitzsetnosy: + nnorwitz
messages: + msg65502
2008-04-09 10:03:25lemburgsetmessages: + msg65234
2008-04-08 23:55:38amaury.forgeotdarcsetmessages: + msg65225
2008-04-08 23:16:44lemburgsetmessages: + msg65223
2008-04-08 20:10:29amaury.forgeotdarcsetmessages: + msg65212
2008-04-08 20:03:27lemburgsetmessages: + msg65211
2008-04-08 20:01:17lemburgsetmessages: - msg65189
2008-04-08 16:45:37lemburgsetnosy: + lemburg
messages: + msg65189
2008-04-07 18:02:54amaury.forgeotdarcsetmessages: + msg65085
2008-04-07 17:54:10gvanrossumsetmessages: + msg65083
2008-04-05 21:46:19amaury.forgeotdarcsetmessages: + msg65009
2008-04-05 18:52:21benjamin.petersonsetfiles: + py3k_raw_strings_unicode_escapes3.patch
messages: + msg64997
2008-04-05 17:03:01georg.brandlsetmessages: + msg64990
2008-04-05 16:49:25benjamin.petersonsetfiles: + py3k_raw_strings_unicode_escapes2.patch
messages: + msg64986
2008-04-05 16:29:15gvanrossumsetmessages: + msg64985
2008-04-05 15:35:19benjamin.petersonsetfiles: + py3k_raw_strings_unicode_escapes.patch
keywords: + patch
messages: + msg64984
2008-04-05 15:26:00gvanrossumsetstatus: closed -> open
assignee: georg.brandl
messages: + msg64982
components: + Documentation, - Unicode
nosy: + georg.brandl, gvanrossum
2008-04-05 14:52:07benjamin.petersonsetresolution: fixed -> rejected
messages: + msg64978
2008-04-03 16:27:40benjamin.petersonsetstatus: open -> closed
resolution: fixed
messages: + msg64900
2008-04-03 13:22:52benjamin.petersonsetpriority: critical
messages: + msg64898
2008-04-03 13:15:00amaury.forgeotdarcsetstatus: closed -> open
resolution: not a bug -> (no value)
messages: + msg64897
nosy: + amaury.forgeotdarc
2008-04-03 12:54:47benjamin.petersonsetstatus: open -> closed
resolution: not a bug
messages: + msg64896
nosy: + benjamin.peterson
2008-04-03 04:19:03jmillikincreate