This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author eryksun
Recipients BTaskaya, eryksun, steve.dower
Date 2019-10-12.19:22:19
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1570908140.02.0.402337981702.issue38454@roundup.psfhosted.org>
In-reply-to
Content
The test assumes that Unix filesystems store names as arbitrary sequences of bytes, with only ASCII slash and null reserved. Windows NTFS stores names as arbitrary sequences of 16-bit words, with many reserved ASCII characters including \/:*?<>"| and control characters 0x00-0x1F. WSL implements a UTF-8 filesystem encoding over this by transcoding bytes from UTF-8 to UTF-16LE and escaping reserved characters (excepting slash and null) as sequences that begin with "#" (e.g. "<#" -> "#003C#0023"). The latter is only visible from Windows in the distro's "LocalState\rootfs" tree.

This scheme fails for TESTFN_UNDECODABLE. Bytes that can't be transcoded to UTF-16LE are replaced by the replacement character U+FFFD. For example:

    >>> n = b'\xff'
    >>> open(n, 'w').close()
    >>> os.listdir(b'.')
    [b'\xef\xbf\xbd']
    >>> hex(ord(os.listdir('.')[0]))
    '0xfffd'

WSL could address this by abandoning their current "#" escaping approach to instead translate all reserved and undecodable bytes to the U+DC00-U+DCFF surrogate range, like Python's "surrogateescape" error handler. The Windows API could even support this with a new flag for MultiByteToWideChar and WideCharToMultiByte.
History
Date User Action Args
2019-10-12 19:22:20eryksunsetrecipients: + eryksun, steve.dower, BTaskaya
2019-10-12 19:22:20eryksunsetmessageid: <1570908140.02.0.402337981702.issue38454@roundup.psfhosted.org>
2019-10-12 19:22:20eryksunlinkissue38454 messages
2019-10-12 19:22:19eryksuncreate