The mbstowcs and mbrtwoc functions which are used for the initial
conversion of command-line arguments on Unix can return lone or
paired surrogates (e.g. \udcff for \xed\xb3\xbf in non-strict
UTF-8), and these surrogates are currently placed into sys.argv
unescaped. This creates various problems such as strings that
cannot be re-encoded into bytes and strings that could represent
more than one byte sequence. Examples follow using the following
script in a UTF-8 locale on Linux:
import sys
print(repr(sys.argv[1]))
print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
Strings that cannot be re-encoded:
$ ./python argtest.py $'\xed\xa0\x80'
'\ud800'
Traceback (most recent call last):
File "argtest.py", line 6, in <module>
print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
$ ./python argtest.py $'\xed\xb0\x80'
'\udc00'
Traceback (most recent call last):
File "argtest.py", line 6, in <module>
print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in
position 0: surrogates not allowed
Aliasing between non-decodable bytes and encoded lone surrogates:
$ ./python argtest.py $'\xff'
'\udcff'
b'\xff'
$ ./python argtest.py $'\xed\xb3\xbf'
'\udcff'
b'\xff'
Aliasing between encoding of a non-BMP character and encoding of
its UTF-16 representation (on narrow Unicode builds):
$ ./python argtest.py $'\xf0\x90\x80\x80'
'\U00010000'
b'\xf0\x90\x80\x80'
$ ./python argtest.py $'\xed\xa0\x80\xed\xb0\x80'
'\U00010000'
b'\xf0\x90\x80\x80'
Attached is a patch to fix these problems by replacing any
decoded characters in the range 0xd800...0xdfff with the
surrogateescape encodings of their source bytes.
|