Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded surrogate characters on command line not escaped in sys.argv #50347

Closed
baikie mannequin opened this issue May 24, 2009 · 2 comments
Closed

Encoded surrogate characters on command line not escaped in sys.argv #50347

baikie mannequin opened this issue May 24, 2009 · 2 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@baikie
Copy link
Mannequin

baikie mannequin commented May 24, 2009

BPO 6097
Nosy @loewis, @ezio-melotti
Files
  • escape-surrogates.diff: Escape surrogates using surrogateescape
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-05-29.16:22:46.178>
    created_at = <Date 2009-05-24.18:03:31.100>
    labels = ['type-bug']
    title = 'Encoded surrogate characters on command line not escaped in sys.argv'
    updated_at = <Date 2009-05-29.16:22:46.142>
    user = 'https://bugs.python.org/baikie'

    bugs.python.org fields:

    activity = <Date 2009-05-29.16:22:46.142>
    actor = 'loewis'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-05-29.16:22:46.178>
    closer = 'loewis'
    components = []
    creation = <Date 2009-05-24.18:03:31.100>
    creator = 'baikie'
    dependencies = []
    files = ['14054']
    hgrepos = []
    issue_num = 6097
    keywords = ['patch']
    message_count = 2.0
    messages = ['88272', '88514']
    nosy_count = 3.0
    nosy_names = ['loewis', 'baikie', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6097'
    versions = ['Python 3.1', 'Python 3.2']

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented May 24, 2009

    The mbstowcs and mbrtwoc functions which are used for the initial
    conversion of command-line arguments on Unix can return lone or
    paired surrogates (e.g. \udcff for \xed\xb3\xbf in non-strict
    UTF-8), and these surrogates are currently placed into sys.argv
    unescaped. This creates various problems such as strings that
    cannot be re-encoded into bytes and strings that could represent
    more than one byte sequence. Examples follow using the following
    script in a UTF-8 locale on Linux:

    import sys
    print(repr(sys.argv[1]))
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
    "surrogateescape")))

    Strings that cannot be re-encoded:

    $ ./python argtest.py $'\xed\xa0\x80'
    '\ud800'
    Traceback (most recent call last):
      File "argtest.py", line 6, in <module>
        print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
    "surrogateescape")))
    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
    position 0: surrogates not allowed
    
    $ ./python argtest.py $'\xed\xb0\x80'
    '\udc00'
    Traceback (most recent call last):
      File "argtest.py", line 6, in <module>
        print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
    "surrogateescape")))
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in
    position 0: surrogates not allowed

    Aliasing between non-decodable bytes and encoded lone surrogates:

    $ ./python argtest.py $'\xff'
    '\udcff'
    b'\xff'
    
    $ ./python argtest.py $'\xed\xb3\xbf'
    '\udcff'
    b'\xff'

    Aliasing between encoding of a non-BMP character and encoding of
    its UTF-16 representation (on narrow Unicode builds):

    $ ./python argtest.py $'\xf0\x90\x80\x80'
    '\U00010000'
    b'\xf0\x90\x80\x80'
    
    $ ./python argtest.py $'\xed\xa0\x80\xed\xb0\x80'
    '\U00010000'
    b'\xf0\x90\x80\x80'

    Attached is a patch to fix these problems by replacing any
    decoded characters in the range 0xd800...0xdfff with the
    surrogateescape encodings of their source bytes.

    @baikie baikie mannequin added the type-bug An unexpected behavior, bug, or error label May 24, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 29, 2009

    Thanks for the patch. Committed as r73020.

    @loewis loewis mannequin closed this as completed May 29, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    0 participants