Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode escape sequences not parsed in raw strings. #46793

Closed
jmillikin mannequin opened this issue Apr 3, 2008 · 24 comments
Closed

Unicode escape sequences not parsed in raw strings. #46793

jmillikin mannequin opened this issue Apr 3, 2008 · 24 comments
Assignees
Labels
docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error

Comments

@jmillikin
Copy link
Mannequin

jmillikin mannequin commented Apr 3, 2008

BPO 2541
Nosy @malemburg, @gvanrossum, @birkenfeld, @amauryfa, @benjaminp, @jmillikin
Files
  • py3k_raw_strings_unicode_escapes.patch
  • py3k_raw_strings_unicode_escapes2.patch
  • py3k_raw_strings_unicode_escapes3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/benjaminp'
    closed_at = <Date 2008-04-28.21:05:55.867>
    created_at = <Date 2008-04-03.04:19:03.157>
    labels = ['type-bug', 'docs']
    title = 'Unicode escape sequences not parsed in raw strings.'
    updated_at = <Date 2008-04-28.21:05:55.865>
    user = 'https://github.com/jmillikin'

    bugs.python.org fields:

    activity = <Date 2008-04-28.21:05:55.865>
    actor = 'benjamin.peterson'
    assignee = 'benjamin.peterson'
    closed = True
    closed_date = <Date 2008-04-28.21:05:55.867>
    closer = 'benjamin.peterson'
    components = ['Documentation']
    creation = <Date 2008-04-03.04:19:03.157>
    creator = 'jmillikin'
    dependencies = []
    files = ['9947', '9948', '9952']
    hgrepos = []
    issue_num = 2541
    keywords = ['patch']
    message_count = 24.0
    messages = ['64890', '64896', '64897', '64898', '64900', '64978', '64982', '64984', '64985', '64986', '64990', '64997', '65009', '65083', '65085', '65211', '65212', '65223', '65225', '65234', '65502', '65512', '65930', '65934']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'gvanrossum', 'nnorwitz', 'georg.brandl', 'amaury.forgeotdarc', 'benjamin.peterson', 'jmillikin']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2541'
    versions = ['Python 3.0']

    @jmillikin
    Copy link
    Mannequin Author

    jmillikin mannequin commented Apr 3, 2008

    According to
    <http://docs.python.org/dev/3.0/reference/lexical_analysis.html#id9\>,
    raw strings with \u and \U escape sequences should have these sequences
    parsed as usual. However, they are currently escaped.

    >>> r'\u0020'
    '\\u0020'
    
    Expected:
    >>> r'\u0020'
    ' '

    @jmillikin jmillikin mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 3, 2008
    @benjaminp
    Copy link
    Contributor

    You use the "ur" string mode.

    >>> print ur"\u0020"
    " "

    @amauryfa
    Copy link
    Member

    amauryfa commented Apr 3, 2008

    No, it's about python 3.0. I confirm the problem, and propose a patch:

    --- Python/ast.c.original       2008-04-03 15:12:15.548389400 +0200
    +++ Python/ast.c        2008-04-03 15:12:28.359475800 +0200
    @@ -3232,7 +3232,7 @@
                 return NULL;
             }
         }
    -    if (!*bytesmode && !rawmode) {
    +    if (!*bytesmode) {
             return decode_unicode(s, len, rawmode, encoding);
         }
         if (*bytesmode) {

    @amauryfa amauryfa reopened this Apr 3, 2008
    @amauryfa amauryfa removed the invalid label Apr 3, 2008
    @benjaminp
    Copy link
    Contributor

    Thanks for noticing, Amaury, and your patch works for me.

    @benjaminp
    Copy link
    Contributor

    Fixed in r62128.

    @benjaminp
    Copy link
    Contributor

    Sorry, Guido said this is not allowed:
    http://mail.python.org/pipermail/python-3000/2008-April/012952.html. I
    reverted it in r62165.

    @gvanrossum
    Copy link
    Member

    The docs still need to be updated! An entry in what's new in 3.0 should
    also be added.

    @gvanrossum gvanrossum added docs Documentation in the Doc dir and removed topic-unicode labels Apr 5, 2008
    @gvanrossum gvanrossum reopened this Apr 5, 2008
    @benjaminp
    Copy link
    Contributor

    How's this?

    @gvanrossum
    Copy link
    Member

    Instead of "ignored" (which might be read ambiguously) how about "not
    treated specially"?

    You also still need to add some words to whatsnew.

    @benjaminp
    Copy link
    Contributor

    "not treated specially" it is!

    @birkenfeld
    Copy link
    Member

    The segment "use different rules for interpreting backslash escape
    sequences." should be killed entirely, and the whole rule told here.

    Also, a few paragraphs later there are more references to raw strings,
    e.g. "When an 'r' or 'R' prefix is used in a string literal,"
    which need to be fixed too.

    @benjaminp
    Copy link
    Contributor

    I made the requested improvements and mentioned it in NEWS. Is there
    worth putting in the tutorial, since it mentions Unicode strings and raw
    strings?

    @amauryfa
    Copy link
    Member

    amauryfa commented Apr 5, 2008

    What about the "raw-unicode-escape" codec?
    Can we leave it different from raw strings literals?

    @gvanrossum
    Copy link
    Member

    To be honest, I don't know what the uses are for that codec.

    @amauryfa
    Copy link
    Member

    amauryfa commented Apr 7, 2008

    pickle still uses it when protocol=0 (and cPickle as well, but in trunk/
    only of course)

    @malemburg
    Copy link
    Member

    You can't change the codec - it's being used in other places as well,
    e.g. for use cases where you need to have an 8-bit encoded readable
    version of a Unicode object (which happens to be Latin-1 + Unicode
    escapes for all non-Latin-1 characters, due to Unicode being a superset
    of Latin-1).

    Adding a new codec would be fine, though I don't know how this would map
    raw Unicode strings with non-Latin-1 characters in them to an 8-bit
    string. Perhaps this is not needed at all in Py3k.

    @amauryfa
    Copy link
    Member

    amauryfa commented Apr 8, 2008

    Isn't "unicode-escape" enough for this purpose?

    @malemburg
    Copy link
    Member

    What do you mean with "enough" ?

    The "raw-unicode-escape" codec is used in Python 2.x to convert literal
    strings of the form ur"" to Unicode objects. It's a variant of the
    "unicode-escape" codec.

    The codec is also being used in cPickle, pickle, variants of pickle,
    Python code generators, etc.

    It serves its purpose, just like "unicode-escape" and all the other
    codecs in Python.

    @amauryfa
    Copy link
    Member

    amauryfa commented Apr 8, 2008

    I mean: now that raw strings cannot represent all unicode points (or
    more precisely, they need the file encoding to do so), is there a use
    case for "raw-unicode-escape" that cannot be filled by the
    unicode-escape codec?

    Note that pickle does not use "raw-unicode-escape" as is: it replaces
    backslashes by \u005c. This has the nice effect that pickled strings can
    also be decoded by "unicode-escape".

    That's why I propose to completely remove raw-unicode-escape, and use
    unicode-escape instead.

    @malemburg
    Copy link
    Member

    While that's true for cPickle, it is not for pickle. The pickle protocol
    itself is defined in terms of the "raw-unicode-escape" codec (see
    pickle.py).

    Besides, you cannot assume that the Python interpreter itself is the
    only use-case for these codecs. The "raw-unicode-escape" codec is well
    usable for other purposes where you need a compact way of encoding
    Unicode, especially if you're strings are mostly Latin-1 and only
    include non-UCS2 code points every now and then. That's also the reason
    why pickle uses it.

    @nnorwitz
    Copy link
    Mannequin

    nnorwitz mannequin commented Apr 15, 2008

    What is the status of this bug? AFAICT, the code is now correct. Have
    the doc changes been applied? The resolution on this report should be
    updated too. It's currently rejected.

    @benjaminp
    Copy link
    Contributor

    It's rejected because the OP wanted unicode escapes to be applied in
    unicode strings, and I haven't applied the docs because nobody has told
    me I should.

    @birkenfeld
    Copy link
    Member

    Please apply the patch, but rename "Unicode escapes" to "\u and \U
    escapes" first.

    @birkenfeld birkenfeld assigned benjaminp and unassigned birkenfeld Apr 28, 2008
    @benjaminp
    Copy link
    Contributor

    Fixed in r62568.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants