This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author baikie
Recipients baikie
Date 2009-05-24.18:03:27
SpamBayes Score 7.774009e-11
Marked as misclassified No
Message-id <1243188212.8.0.809518325398.issue6097@psf.upfronthosting.co.za>
In-reply-to
Content
The mbstowcs and mbrtwoc functions which are used for the initial
conversion of command-line arguments on Unix can return lone or
paired surrogates (e.g. \udcff for \xed\xb3\xbf in non-strict
UTF-8), and these surrogates are currently placed into sys.argv
unescaped.  This creates various problems such as strings that
cannot be re-encoded into bytes and strings that could represent
more than one byte sequence.  Examples follow using the following
script in a UTF-8 locale on Linux:

import sys
print(repr(sys.argv[1]))
print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))


Strings that cannot be re-encoded:

$ ./python argtest.py $'\xed\xa0\x80'
'\ud800'
Traceback (most recent call last):
  File "argtest.py", line 6, in <module>
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

$ ./python argtest.py $'\xed\xb0\x80'
'\udc00'
Traceback (most recent call last):
  File "argtest.py", line 6, in <module>
    print(repr(sys.argv[1].encode(sys.getfilesystemencoding(),
"surrogateescape")))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in
position 0: surrogates not allowed


Aliasing between non-decodable bytes and encoded lone surrogates:

$ ./python argtest.py $'\xff'
'\udcff'
b'\xff'

$ ./python argtest.py $'\xed\xb3\xbf'
'\udcff'
b'\xff'


Aliasing between encoding of a non-BMP character and encoding of
its UTF-16 representation (on narrow Unicode builds):

$ ./python argtest.py $'\xf0\x90\x80\x80'
'\U00010000'
b'\xf0\x90\x80\x80'

$ ./python argtest.py $'\xed\xa0\x80\xed\xb0\x80'
'\U00010000'
b'\xf0\x90\x80\x80'


Attached is a patch to fix these problems by replacing any
decoded characters in the range 0xd800...0xdfff with the
surrogateescape encodings of their source bytes.
History
Date User Action Args
2009-05-24 18:03:32baikiesetrecipients: + baikie
2009-05-24 18:03:32baikiesetmessageid: <1243188212.8.0.809518325398.issue6097@psf.upfronthosting.co.za>
2009-05-24 18:03:30baikielinkissue6097 messages
2009-05-24 18:03:28baikiecreate