classification
Title: os.path states that bytes can't represent all MBCS paths under Windows
Type: enhancement Stage:
Components: Documentation, Unicode, Windows Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, ericzolf, eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2021-03-04 06:31 by ericzolf, last changed 2021-03-18 01:43 by vstinner.

Messages (4)
msg388077 - (view) Author: Eric L. (ericzolf) * Date: 2021-03-04 06:31
The os.path documentation at https://docs.python.org/3/library/os.path.html states that:

> Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.

This doesn't sound right and is at least misleading because anything can be represented as bytes, as everything (in a computer) is bytes at the end of the day, unless mbcs is really using something like half-bytes, which I couldn't find any sign of (skimming through the documentation, Microsoft seems to interpret it as DBCS, one or two bytes).

I could imagine that the meaning is that some bytes combinations can't be used as path under Windows, but I just don't know, and that wouldn't be a valid reason to not use bytes under Windows (IMHO).
msg388090 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-04 14:27
> Vice versa, using bytes objects cannot represent all file names 
> on Windows (in the standard mbcs encoding), hence Windows 
> applications should use string objects to access all files.

This is outdated advice that should be removed, or at least reworded to emphasize that the 'mbcs' encoding is only used in legacy mode, with a link to the documentation of sys._enablelegacywindowsfsencoding [1].

Starting in Python 3.6, the default filesystem encoding in Windows is UTF-8. Internally, what happens is that a UTF-8 byte string gets translated to UTF-16 (2 or 4 bytes per character), the native Unicode encoding of the Windows API. 

A caveat is that Windows filesystems use 16-bit characters that are not restricted to valid Unicode. In particular, ordinals U+D800-U+DFFF are not reserved for use in surrogate pairs. This is "Wobbly" Unicode, and the filesystem encoding thus needs to be "Wobbly Transformation Format, 8-bit" (WTF-8). This is implemented in Python by setting the encode errors handler to "surrogatepass", in contrast to using "surrogateescape" in POSIX. For example, os.fsencode('\ud800') succeeds in Windows but fails in POSIX, while os.fsdecode(b'\x80') fails in Windows but succeeds in POSIX. The latter case is not a practical problem since filesystem functions will never return an invalid WTF-8 byte string.

---
[1] https://docs.python.org/3/library/sys.html#sys._enablelegacywindowsfsencoding
msg388147 - (view) Author: Eric L. (ericzolf) * Date: 2021-03-05 06:03
Very confusing but very interesting. I'm trying to follow as I'm the main maintainer of the rdiff-backup software, which goes cross-platforms, so these small differences might become important.

Now, looking into the docs, following your explanations, I noticed that https://docs.python.org/3/library/os.html#os.fsencode and https://docs.python.org/3/library/os.html#os.fsdecode state that the 'strict' error handler is used under Windows instead of the stated 'surrogatepass'. Again an issue with the documentation?

Also, the 2nd paragraph of https://docs.python.org/3.8/library/os.html#file-names-command-line-arguments-and-environment-variables speaks only of surrogateescape and doesn't make the difference between POSIX and Windows.

Very interesting but very confusing...
msg388151 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-05 10:49
>  instead of the stated 'surrogatepass'

In Python 3.6 and above, you can check this as follows:

    >>> sys.getfilesystemencoding()
    'utf-8'
    >>> sys.getfilesystemencodeerrors()
    'surrogatepass'

In Python 3.5 and previous:

    >>> sys.getfilesystemencoding()
    'mbcs'

In 3.5, the error handler used by fsencode() and fsdecode() was hard coded as 'strict' for the 'mbcs' encoding, and otherwise 'surrogateescape'.

> https://docs.python.org/3/library/os.html#os.fsencode
> https://docs.python.org/3/library/os.html#os.fsdecode

The above documentation needs to be updated to reference sys.getfilesystemencodeerrors(), as do the doc strings:

    >>> print(textwrap.dedent(os.fsencode.__doc__))

    Encode filename to the filesystem encoding with 'surrogateescape' error
    handler, return bytes unchanged. On Windows, use 'strict' error handler if
    the file system encoding is 'mbcs' (which is the default encoding).

    >>> print(textwrap.dedent(os.fsdecode.__doc__))

    Decode filename from the filesystem encoding with 'surrogateescape' error
    handler, return str unchanged. On Windows, use 'strict' error handler if
    the file system encoding is 'mbcs' (which is the default encoding).

> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables

This should be rewritten to link to sys.getfilesystemencodeerrors(). I'm fine with only discussing the use of "surrogateescape", which is a significant concern in POSIX systems, for which it is very easy and common for filenames to be created with an arbitrary encoding. 

I don't know if the use of "surrogatepass" in Windows warrants discussion. It is uncommon to need the error handler because the filesystem is Unicode. A user is unlikely to create a filename with an unpaired surrogate code. 

That said, before Windows 10, the legacy console allowed copying half of a surrogate pair to the clipboard, and a program could have a bug that nulls the second surrogate code in the pair (e.g. when limiting the length of a filename). Anyway, it's technically possible, so we support it. For example, "😈" (U+0001F608) is encoded in UTF-16 as the pair (U+D83D, U+DE08). A filename could end up with only the first of the two codes:

    >>> open('devil\ud83d', 'w').close()
    >>> print(ascii(os.listdir('.')[0]))
    'devil\ud83d'
History
Date User Action Args
2021-03-18 01:43:40vstinnersetnosy: - vstinner
2021-03-17 19:50:33eryksunsetversions: - Python 3.6, Python 3.7
2021-03-05 10:49:06eryksunsetmessages: + msg388151
2021-03-05 06:03:36ericzolfsetmessages: + msg388147
2021-03-04 19:07:25eryksunlinkissue43403 superseder
2021-03-04 14:27:28eryksunsetnosy: + paul.moore, tim.golden, ezio.melotti, eryksun, vstinner, zach.ware, steve.dower
messages: + msg388090
components: + Unicode, Windows
2021-03-04 06:31:25ericzolfcreate