Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

Closed
zuo mannequin opened this issue Nov 10, 2013 · 8 comments
Closed

The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x #63738

zuo mannequin opened this issue Nov 10, 2013 · 8 comments
Labels
docs Documentation in the Doc dir topic-unicode type-feature A feature request or enhancement

Comments

@zuo
Copy link
Mannequin

zuo mannequin commented Nov 10, 2013

BPO 19539
Nosy @malemburg, @terryjreedy, @vstinner, @ezio-melotti, @vadmium
Superseder
  • bpo-19548: 'codecs' module docs improvements
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-12-26.00:31:41.928>
    created_at = <Date 2013-11-10.02:51:45.316>
    labels = ['type-feature', 'expert-unicode', 'docs']
    title = "The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x"
    updated_at = <Date 2014-12-28.08:48:33.623>
    user = 'https://bugs.python.org/zuo'

    bugs.python.org fields:

    activity = <Date 2014-12-28.08:48:33.623>
    actor = 'vstinner'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2014-12-26.00:31:41.928>
    closer = 'zuo'
    components = ['Documentation', 'Unicode']
    creation = <Date 2013-11-10.02:51:45.316>
    creator = 'zuo'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 19539
    keywords = []
    message_count = 8.0
    messages = ['202505', '202507', '202591', '202643', '232851', '233010', '233102', '233147']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'terry.reedy', 'vstinner', 'ezio.melotti', 'zuo', 'docs@python', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '19548'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue19539'
    versions = ['Python 3.3', 'Python 3.4']

    @zuo
    Copy link
    Mannequin Author

    zuo mannequin commented Nov 10, 2013

    It seems that the 'raw_unicode_escape' codec:

    1. produces data that could be suitable for Python 2.x raw unicode string literals and not for Python 3.x raw unicode string literals (in Python 3.x \u... escapes are also treated literally);

    2. seems to be buggy anyway: bytes in range 128-255 are encoded with the 'latin-1' encoding (in Python 3.x it is definitely a bug; and even in Python 2.x the feature is dubious, although at least the Py2's eval() and compile() functions officially accept 'latin-1'-encoded byte strings...).

    Python 3.3:

    >>> b = "zażółć".encode('raw_unicode_escape')
    >>> literal = b'r"' + b + b'"'
    >>> literal
    b'r"za\\u017c\xf3\\u0142\\u0107"'
    >>> eval(literal)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<string>", line 1
    SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
    >>> b'\xf3'.decode('latin-1')
    'ó'
    >>> b = "zaż".encode('raw_unicode_escape')
    >>> literal = b'r"' + b + b'"'
    >>> literal
    b'r"za\\u017c"'
    >>> eval(literal)
    'za\\u017c'
    >>> print(eval(literal))
    za\u017c

    It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters.

    PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue bpo-7615) -- shouldn't it be at least documented explicitly?

    @zuo zuo mannequin added stdlib Python modules in the Lib dir topic-unicode labels Nov 10, 2013
    @serhiy-storchaka
    Copy link
    Member

    The 'raw_unicode_escape' codec can't be neither removed nor changed because it is used in pickle protocol. Just don't use it if its behavior looks weird for you.

    Right way to decode raw_unicode_escape-encoded data is use 'raw_unicode_escape' decoder.

    If a string don't contain quotes, you can use eval(), but you should first decode data from latin1 and encode to UTF-8:

    >>> literal = ('r"%s"' % "zażółć".encode('raw_unicode_escape').decode('latin1')).encode()
    >>> literal
    b'r"za\\u017c\xc3\xb3\\u0142\\u0107"'
    >>> eval(literal)
    'za\\u017có\\u0142\\u0107'

    @zuo
    Copy link
    Mannequin Author

    zuo mannequin commented Nov 11, 2013

    Which means that the description "Produce a string that is suitable as raw Unicode literal in Python source code" is (in Python 3.x) no longer true.

    So, if change/removal is not possible because of internal significance of the codec, I believe that the description should be changed to something like: "For internal use. This codec *does not* produce anything suitable as a raw string literal in Python 3.x source code."

    @zuo zuo mannequin added the docs Documentation in the Doc dir label Nov 11, 2013
    @zuo zuo mannequin assigned docspython Nov 11, 2013
    @serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement and removed stdlib Python modules in the Lib dir labels Nov 11, 2013
    @malemburg
    Copy link
    Member

    Jan, the codec implements an encoding which has certain characteristics just like any other codec. It works both in Python 2 and 3 without problems.

    The documentation is no longer true, though. Ever since we added encoding markers to source files, the raw Unicode string literals depended on this encoding setting. Before this change the docs were fine, since Unicode literals were interpreted as Latin-1 encoded.

    More correct would be: "Produce a string that uses Unicode escapes to encode non-Latin-1 code points. It is used in the Python pickle protocol."

    @malemburg malemburg changed the title The 'raw_unicode_escape' codec buggy + not apropriate for Python 3.x The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x Nov 11, 2013
    @vadmium
    Copy link
    Member

    vadmium commented Dec 18, 2014

    I included the proposed doc fix in my patch for bpo-19548

    @vadmium
    Copy link
    Member

    vadmium commented Dec 22, 2014

    [Edit Error: 'utf8' codec can't decode byte 0xe2 in position 212: invalid continuation byte]

    Re-reading the suggested description, it struck me that for encoding, this is redundant with the “backslashreplace” error handler:

    >>> test = "".join(map(chr, range(sys.maxunicode + 1)))
    >>> test.encode("raw-unicode-escape") == test.encode("latin-1", "backslashreplace")
    True

    However, decoding also seems similar to “unicode_escape”, except that only \uXXXX and \UXXXXXXXX seem to be supported.

    Maybe there should be a warning that backslashes are not escaped:

    >>> "\\u005C".encode("raw-unicode-escape").decode("raw-unicode-escape")
    '\\'

    @zuo zuo mannequin closed this as completed Dec 26, 2014
    @zuo
    Copy link
    Mannequin Author

    zuo mannequin commented Dec 26, 2014

    My concerns are now being addressed in the bpo-19548.

    @vstinner
    Copy link
    Member

    This issue is just a documentation issue. The do must be more explicit,
    explain that the codecs is only used internally by the pickle module, and
    that its output cannot be used anymore by eval().

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants