Issue 43403: Misleading statement about bytes not being able to represent windows filenames in documentation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87569

classification

Title:	Misleading statement about bytes not being able to represent windows filenames in documentation
Type:		Stage:	resolved
Components:	Documentation	Versions:	Python 3.10, Python 3.9, Python 3.8, Python 3.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	os.path states that bytes can't represent all MBCS paths under Windows View: 43395
Assigned To:	docs@python	Nosy List:	docs@python, eryksun, gregory.p.smith, steve.dower
Priority:	normal	Keywords:

Created on 2021-03-04 18:43 by gregory.p.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (2)
msg388122 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2021-03-04 18:43
As noted in the comment on https://github.com/rdiff-backup/rdiff-backup/issues/540#issuecomment-789485896 The Python documentation in https://docs.python.org/3/library/os.path.html makes an odd claim that bytes cannot represent all file names on Windows. That doesn't make sense. bytes can by definition represent everything. """Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.""" Could we get this clarified and corrected to cover what any actual technical limitation is? Every OS is going to reject some bytes objects as a pathname for containing invalid byte sequences for their filesystem (ex: I doubt any OS allows null b'\0' characters). But lets not claim that bytes cannot represent everything on a filesystem with an encoding.
msg388124 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-03-04 19:26
> lets not claim that bytes cannot represent everything on a filesystem > with an encoding. Gregory, before changing the filesystem encoding to UTF-8 in Python 3.6, the [A]NSI file API (e.g. CreateFileA) was used for bytes paths and the [W]ide character file API was used for str paths (e.g. CreateFileW). The ANSI API is a set of wrapper functions that automatically translate strings between the ANSI code page of the current process and the system's native UTF-16 encoding, before and after calling the wide-character function (or a common internal function). Starting with Windows 10, the ANSI and OEM code pages of a process are finally allowed to be UTF-8 (code page 65001), but it's still considered beta and barely used. Usually the ANSI API is set to a legacy single-byte or double-byte code page such as 1252 (Western Europe) or 932 (Japanese). Natively, Windows is UTF-16, and native Windows filesystems store filenames on disk using 16-bit characters. The system doesn't check for valid Unicode, so lone surrogate codes are allowed. This is sometimes called a "Wobbly" format. In Python it requires the "surrogatepass" error handler.

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87569
2021-03-04 19:26:58	eryksun	set	messages: + msg388124
2021-03-04 19:07:25	eryksun	set	status: open -> closed superseder: os.path states that bytes can't represent all MBCS paths under Windows resolution: duplicate stage: needs patch -> resolved
2021-03-04 18:44:56	ammar2	set	nosy: + eryksun
2021-03-04 18:43:56	gregory.p.smith	create