This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author eryksun
Recipients docs@python, ericzolf, eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Date 2021-03-05.10:49:04
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1614941346.23.0.754010393276.issue43395@roundup.psfhosted.org>
In-reply-to
Content
>  instead of the stated 'surrogatepass'

In Python 3.6 and above, you can check this as follows:

    >>> sys.getfilesystemencoding()
    'utf-8'
    >>> sys.getfilesystemencodeerrors()
    'surrogatepass'

In Python 3.5 and previous:

    >>> sys.getfilesystemencoding()
    'mbcs'

In 3.5, the error handler used by fsencode() and fsdecode() was hard coded as 'strict' for the 'mbcs' encoding, and otherwise 'surrogateescape'.

> https://docs.python.org/3/library/os.html#os.fsencode
> https://docs.python.org/3/library/os.html#os.fsdecode

The above documentation needs to be updated to reference sys.getfilesystemencodeerrors(), as do the doc strings:

    >>> print(textwrap.dedent(os.fsencode.__doc__))

    Encode filename to the filesystem encoding with 'surrogateescape' error
    handler, return bytes unchanged. On Windows, use 'strict' error handler if
    the file system encoding is 'mbcs' (which is the default encoding).

    >>> print(textwrap.dedent(os.fsdecode.__doc__))

    Decode filename from the filesystem encoding with 'surrogateescape' error
    handler, return str unchanged. On Windows, use 'strict' error handler if
    the file system encoding is 'mbcs' (which is the default encoding).

> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables

This should be rewritten to link to sys.getfilesystemencodeerrors(). I'm fine with only discussing the use of "surrogateescape", which is a significant concern in POSIX systems, for which it is very easy and common for filenames to be created with an arbitrary encoding. 

I don't know if the use of "surrogatepass" in Windows warrants discussion. It is uncommon to need the error handler because the filesystem is Unicode. A user is unlikely to create a filename with an unpaired surrogate code. 

That said, before Windows 10, the legacy console allowed copying half of a surrogate pair to the clipboard, and a program could have a bug that nulls the second surrogate code in the pair (e.g. when limiting the length of a filename). Anyway, it's technically possible, so we support it. For example, "😈" (U+0001F608) is encoded in UTF-16 as the pair (U+D83D, U+DE08). A filename could end up with only the first of the two codes:

    >>> open('devil\ud83d', 'w').close()
    >>> print(ascii(os.listdir('.')[0]))
    'devil\ud83d'
History
Date User Action Args
2021-03-05 10:49:06eryksunsetrecipients: + eryksun, paul.moore, vstinner, tim.golden, ezio.melotti, docs@python, zach.ware, steve.dower, ericzolf
2021-03-05 10:49:06eryksunsetmessageid: <1614941346.23.0.754010393276.issue43395@roundup.psfhosted.org>
2021-03-05 10:49:06eryksunlinkissue43395 messages
2021-03-05 10:49:04eryksuncreate