Author eryksun
Recipients Neui, SilentGhost, eryksun, ncoghlan
Date 2019-02-01.23:49:22
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1549064962.38.0.0212201261109.issue35883@roundup.psfhosted.org>
In-reply-to
Content
In Unix, Python 3.6 decodes the char * command line arguments via mbstowcs. In Linux, I see the following misbehavior of mbstowcs when decoding an overlong UTF-8 sequence:

    >>> mbstowcs = ctypes.CDLL(None, use_errno=True).mbstowcs
    >>> arg = bytes(x + 128 for x in [1 + 124, 63, 63, 59, 58, 58])
    >>> mbstowcs(None, arg, 0)
    1
    >>> buf = (ctypes.c_int * 2)()
    >>> mbstowcs(buf, arg, 2)
    1
    >>> hex(buf[0])
    '0x7fffbeba'

This shouldn't be an issue in 3.7, at least not with the default UTF-8 mode configuration. With this mode, Py_DecodeLocale calls _Py_DecodeUTF8Ex using the surrogateescape error handler [1].

[1]: https://github.com/python/cpython/blob/v3.7.2/Python/fileutils.c#L456
History
Date User Action Args
2019-02-01 23:49:23eryksunsetrecipients: + eryksun, ncoghlan, SilentGhost, Neui
2019-02-01 23:49:22eryksunsetmessageid: <1549064962.38.0.0212201261109.issue35883@roundup.psfhosted.org>
2019-02-01 23:49:22eryksunlinkissue35883 messages
2019-02-01 23:49:22eryksuncreate