classification
Title: Misleading statement about bytes not being able to represent windows filenames in documentation
Type: Stage: resolved
Components: Documentation Versions: Python 3.10, Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: os.path states that bytes can't represent all MBCS paths under Windows
View: 43395
Assigned To: docs@python Nosy List: docs@python, eryksun, gregory.p.smith, steve.dower
Priority: normal Keywords:

Created on 2021-03-04 18:43 by gregory.p.smith, last changed 2021-03-04 19:26 by eryksun. This issue is now closed.

Messages (2)
msg388122 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2021-03-04 18:43
As noted in the comment on https://github.com/rdiff-backup/rdiff-backup/issues/540#issuecomment-789485896

The Python documentation in https://docs.python.org/3/library/os.path.html makes an odd claim that bytes cannot represent all file names on Windows.  That doesn't make sense.  bytes can by definition represent everything.

"""Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files."""

Could we get this clarified and corrected to cover what any actual technical limitation is?

Every OS is going to reject some bytes objects as a pathname for containing invalid byte sequences for their filesystem (ex: I doubt any OS allows null b'\0' characters).  But lets not claim that bytes cannot represent everything on a filesystem with an encoding.
msg388124 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-04 19:26
> lets not claim that bytes cannot represent everything on a filesystem 
> with an encoding.

Gregory, before changing the filesystem encoding to UTF-8 in Python 3.6, the [A]NSI file API (e.g. CreateFileA) was used for bytes paths and the [W]ide character file API was used for str paths (e.g. CreateFileW). The ANSI API is a set of wrapper functions that automatically translate strings between the ANSI code page of the current process and the system's native UTF-16 encoding, before and after calling the wide-character function (or a common internal function). Starting with Windows 10, the ANSI and OEM code pages of a process are finally allowed to be UTF-8 (code page 65001), but it's still considered beta and barely used. Usually the ANSI API is set to a legacy single-byte or double-byte code page such as 1252 (Western Europe) or 932 (Japanese). 

Natively, Windows is UTF-16, and native Windows filesystems store filenames on disk using 16-bit characters. The system doesn't check for valid Unicode, so lone surrogate codes are allowed. This is sometimes called a "Wobbly" format. In Python it requires the "surrogatepass" error handler.
History
Date User Action Args
2021-03-04 19:26:58eryksunsetmessages: + msg388124
2021-03-04 19:07:25eryksunsetstatus: open -> closed
superseder: os.path states that bytes can't represent all MBCS paths under Windows
resolution: duplicate
stage: needs patch -> resolved
2021-03-04 18:44:56ammar2setnosy: + eryksun
2021-03-04 18:43:56gregory.p.smithcreate