classification
Title: Encoded surrogate characters on command line not escaped in sys.argv
Type: behavior Stage:
Components: Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: baikie, ezio.melotti, loewis
Priority: normal Keywords: patch

Created on 2009-05-24 18:03 by baikie, last changed 2009-05-29 16:22 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
escape-surrogates.diff baikie, 2009-05-24 18:03 Escape surrogates using surrogateescape
Messages (2)
msg88272 - (view) Author: David Watson (baikie) Date: 2009-05-24 18:03
The mbstowcs and mbrtwoc functions which are used for the initial
conversion of command-line arguments on Unix can return lone or
paired surrogates (e.g. \udcff for \xed\xb3\xbf in non-strict
UTF-8), and these surrogates are currently placed into sys.argv
unescaped.  This creates various problems such as strings that
cannot be re-encoded into bytes and strings that could represent
more than one byte sequence.  Examples follow using the following
script in a UTF-8 locale on Linux:

import sys
print(repr(sys.argv[1]))
print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))


Strings that cannot be re-encoded:

$ ./python argtest.py $'\xed\xa0\x80'
'\ud800'
Traceback (most recent call last):
  File "argtest.py", line 6, in <module>
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

$ ./python argtest.py $'\xed\xb0\x80'
'\udc00'
Traceback (most recent call last):
  File "argtest.py", line 6, in <module>
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in
position 0: surrogates not allowed


Aliasing between non-decodable bytes and encoded lone surrogates:

$ ./python argtest.py $'\xff'
'\udcff'
b'\xff'

$ ./python argtest.py $'\xed\xb3\xbf'
'\udcff'
b'\xff'


Aliasing between encoding of a non-BMP character and encoding of
its UTF-16 representation (on narrow Unicode builds):

$ ./python argtest.py $'\xf0\x90\x80\x80'
'\U00010000'
b'\xf0\x90\x80\x80'

$ ./python argtest.py $'\xed\xa0\x80\xed\xb0\x80'
'\U00010000'
b'\xf0\x90\x80\x80'


Attached is a patch to fix these problems by replacing any
decoded characters in the range 0xd800...0xdfff with the
surrogateescape encodings of their source bytes.
msg88514 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-05-29 16:22
Thanks for the patch. Committed as r73020.
History
Date User Action Args
2009-05-29 16:22:46loewissetstatus: open -> closed

nosy: + loewis
messages: + msg88514

resolution: accepted
2009-05-29 07:06:39ezio.melottisetnosy: + ezio.melotti
2009-05-24 18:03:31baikiecreate