Message 354550 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	BTaskaya, eryksun, steve.dower
Date	2019-10-12.19:22:19
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1570908140.02.0.402337981702.issue38454@roundup.psfhosted.org>
In-reply-to

Content
The test assumes that Unix filesystems store names as arbitrary sequences of bytes, with only ASCII slash and null reserved. Windows NTFS stores names as arbitrary sequences of 16-bit words, with many reserved ASCII characters including \/:*?<>"\| and control characters 0x00-0x1F. WSL implements a UTF-8 filesystem encoding over this by transcoding bytes from UTF-8 to UTF-16LE and escaping reserved characters (excepting slash and null) as sequences that begin with "#" (e.g. "<#" -> "#003C#0023"). The latter is only visible from Windows in the distro's "LocalState\rootfs" tree. This scheme fails for TESTFN_UNDECODABLE. Bytes that can't be transcoded to UTF-16LE are replaced by the replacement character U+FFFD. For example: >>> n = b'\xff' >>> open(n, 'w').close() >>> os.listdir(b'.') [b'\xef\xbf\xbd'] >>> hex(ord(os.listdir('.')[0])) '0xfffd' WSL could address this by abandoning their current "#" escaping approach to instead translate all reserved and undecodable bytes to the U+DC00-U+DCFF surrogate range, like Python's "surrogateescape" error handler. The Windows API could even support this with a new flag for MultiByteToWideChar and WideCharToMultiByte.

The test assumes that Unix filesystems store names as arbitrary sequences of bytes, with only ASCII slash and null reserved. Windows NTFS stores names as arbitrary sequences of 16-bit words, with many reserved ASCII characters including \/:*?<>"| and control characters 0x00-0x1F. WSL implements a UTF-8 filesystem encoding over this by transcoding bytes from UTF-8 to UTF-16LE and escaping reserved characters (excepting slash and null) as sequences that begin with "#" (e.g. "<#" -> "#003C#0023"). The latter is only visible from Windows in the distro's "LocalState\rootfs" tree.

This scheme fails for TESTFN_UNDECODABLE. Bytes that can't be transcoded to UTF-16LE are replaced by the replacement character U+FFFD. For example:

    >>> n = b'\xff'
    >>> open(n, 'w').close()
    >>> os.listdir(b'.')
    [b'\xef\xbf\xbd']
    >>> hex(ord(os.listdir('.')[0]))
    '0xfffd'

WSL could address this by abandoning their current "#" escaping approach to instead translate all reserved and undecodable bytes to the U+DC00-U+DCFF surrogate range, like Python's "surrogateescape" error handler. The Windows API could even support this with a new flag for MultiByteToWideChar and WideCharToMultiByte.

History
Date	User	Action	Args
2019-10-12 19:22:20	eryksun	set	recipients: + eryksun, steve.dower, BTaskaya
2019-10-12 19:22:20	eryksun	set	messageid: <1570908140.02.0.402337981702.issue38454@roundup.psfhosted.org>
2019-10-12 19:22:20	eryksun	link	issue38454 messages
2019-10-12 19:22:19	eryksun	create