Message 71991 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dlitz
Recipients	HWJ, amaury.forgeotdarc, benjamin.peterson, dlitz, gvanrossum, pitrou, vstinner
Date	2008-08-26.18:15:14
SpamBayes Score	4.828587e-07
Marked as misclassified	No
Message-id	<1219774521.19.0.793476495037.issue3187@psf.upfronthosting.co.za>
In-reply-to

Content
I think Guido already understands this, but I haven't seen it stated very clearly here: Different systems use different "things" to identify files. On Linux/ext3, all filenames are octet strings (i.e. bytes), and only the following caveats apply: - a filename/pathname cannot contain the zero-octet (b"\x00"). - a filename/pathname cannot be empty. - a filename cannot contain the slash (b"/"); In a pathname, the slash is used to separate filenames. - the filenames b"." and b".." have special meanings; They cannot be created, deleted, or renamed. All filenames that meet these criteria are valid, and calling them "invalid" amounts to plugging one's ears and shouting "LA LA LA" while imagining Unicode having pre-dated Unix. It is sometimes convenient to imagine filenames on Linux/ext3 as sequences of Unicode code points (where the encoding is specified by LC_CTYPE---it's not necessarily UTF-8), but other times (e.g. in backup tools that need to be robust in the face of mischievous users) it is an unnecessary abstraction that introduces bugs. On Windows/NTFS, the situation is entirely different: Filenames are actually sequences of Unicode code points, and if you pretend they are octet strings, Windows will happily invent phantom filenames for you that will show up in the output of os.listdir(), but that will return "File not found" if you try to open them for reading (if you open them for writing, you risk clobbering other files that happens to have the same names). To avoid bugs, it should be possible to work exclusively with filenames in the platform's native representation. It was possible in Python 2 (though you had to be very careful). Ideally, Python 3 would recognize and enforce the difference instead of trying to guess the translations; "Explicit is better than implicit" and all that.

I think Guido already understands this, but I haven't seen it stated
very clearly here:

** Different systems use different "things" to identify files. **

On Linux/ext3, all filenames are *octet strings* (i.e. bytes), and
*only* the following caveats apply:
- a filename/pathname cannot contain the zero-octet (b"\x00").
- a filename/pathname cannot be empty.
- a filename cannot contain the slash (b"/"); In a pathname, the slash
is used to separate filenames.
- the filenames b"." and b".." have special meanings; They cannot be
created, deleted, or renamed.

All filenames that meet these criteria are valid, and calling them
"invalid" amounts to plugging one's ears and shouting "LA LA LA" while
imagining Unicode having pre-dated Unix.

It is sometimes convenient to imagine filenames on Linux/ext3 as
sequences of Unicode code points (where the encoding is specified by
LC_CTYPE---it's not necessarily UTF-8), but other times (e.g. in backup
tools that need to be robust in the face of mischievous users) it is an
unnecessary abstraction that introduces bugs.

On Windows/NTFS, the situation is entirely different: Filenames are
actually sequences of Unicode code points, and if you pretend they are
octet strings, Windows will happily invent phantom filenames for you
that will show up in the output of os.listdir(), but that will return
"File not found" if you try to open them for reading (if you open them
for writing, you risk clobbering other files that happens to have the
same names).

To avoid bugs, it should be possible to work exclusively with filenames
in the platform's native representation.  It was possible in Python 2
(though you had to be very careful).  Ideally, Python 3 would recognize
and enforce the difference instead of trying to guess the translations;
"Explicit is better than implicit" and all that.

History
Date	User	Action	Args
2008-08-26 18:15:21	dlitz	set	recipients: + dlitz, gvanrossum, amaury.forgeotdarc, pitrou, vstinner, benjamin.peterson, HWJ
2008-08-26 18:15:21	dlitz	set	messageid: <1219774521.19.0.793476495037.issue3187@psf.upfronthosting.co.za>
2008-08-26 18:15:20	dlitz	link	issue3187 messages
2008-08-26 18:15:14	dlitz	create