Message 116230 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	loewis, vstinner
Date	2010-09-12.21:14:53
SpamBayes Score	2.0972113e-13
Marked as misclassified	No
Message-id	<201009122314.46191.victor.stinner@haypocalc.com>
In-reply-to	<4C8D09F0.7020901@v.loewis.de>

Content
It remembers me the discussion of the issue #3187. About unencodable filenames, Guido proposed to ignore them or to use errors="replace", and wrote "Failing the entire os.listdir() call is not acceptable". (... long discussion ...) And finally, os.listdir() ignored undecodable filenames on UNIX/BSD. Then you introduced the genious PEP 383 (utf8b then renamed surrogateescape) and os.listdir() now raises an error if the PyUnicode_FromEncodedObject(v, Py_FileSystemDefaultEncoding, "surrogateescape") fails... which doesn't occur because of undecodable byte sequence, but for other reasons like a memory error. About Windows, os.listdir(str) never fails, but my question is about os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded with "mbcs+replace", filenames not usable to open, rename or delete the file) or just ignore them? > Ok. Then I'm -1 on the patch: you can't know whether the application > actually wants to open the file. Perhaps it only wants to display the > file names, or perhaps it only wants to open some of the files, or > only traverse into subdirectories. > > For backwards compatibility, I recommend to leave things as they are. > FindFirst/NextFileA will also do some other interesting conversions, > such as the best-fit conversion (which the "mbcs" code doesn't do > (anymore?)). "it only wants to open some of the files" is the typical reason for which I hate Python2 and its implicit conversion between bytes and characters: it works in most cases, but it fails "sometimes". The problem is to define (and explain) "sometimes". The typical use case of listing a directory is a file chooser. On Windows using the bytes API, it works in most cases, but it fails if the user picks the "wrong" file (name with "?"). That's the problem I would like to address. -- Ignore unencodable filenames solution is compatible with the "traverse into subdirectories" case. And it does also keep backward compatibility (except that unencodable files are hidden, which is a least problem I think). -- I proposed to raise an error on unencodable filename. I changed my mind after reading your answer and the discussion on #3187. My patch breaks compatibility and users don't bother to unencodable filenames. Eg. glob("*.mp3") should not fail if the directory contains a temporary unencodable filename ("xxx.tmp").

It remembers me the discussion of the issue #3187. About unencodable filenames, 
Guido proposed to ignore them or to use errors="replace", and wrote "Failing 
the entire os.listdir() call is not acceptable". (... long discussion ...) And 
finally, os.listdir() ignored undecodable filenames on UNIX/BSD.

Then you introduced the genious PEP 383 (utf8b then renamed surrogateescape) 
and os.listdir() now raises an error if the PyUnicode_FromEncodedObject(v, 
Py_FileSystemDefaultEncoding, "surrogateescape") fails... which doesn't occur 
because of undecodable byte sequence, but for other reasons like a memory 
error.

About Windows, os.listdir(str) never fails, but my question is about 
os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded 
with "mbcs+replace", filenames not usable to open, rename or delete the file) or 
just ignore them?

> Ok. Then I'm -1 on the patch: you can't know whether the application
> actually wants to open the file. Perhaps it only wants to display the
> file names, or perhaps it only wants to open some of the files, or
> only traverse into subdirectories.
>
> For backwards compatibility, I recommend to leave things as they are.
> FindFirst/NextFileA will also do some other interesting conversions,
> such as the best-fit conversion (which the "mbcs" code doesn't do
> (anymore?)).

"it only wants to open some of the files" is the typical reason for which I 
hate Python2 and its implicit conversion between bytes and characters: it 
works in most cases, but it fails "sometimes". The problem is to define (and 
explain) "sometimes".

The typical use case of listing a directory is a file chooser. On Windows using 
the bytes API, it works in most cases, but it fails if the user picks the 
"wrong" file (name with "?"). That's the problem I would like to address.

--

Ignore unencodable filenames solution is compatible with the "traverse into 
subdirectories" case. And it does also keep backward compatibility (except 
that unencodable files are hidden, which is a least problem I think).

--

I proposed to raise an error on unencodable filename. I changed my mind after 
reading your answer and the discussion on #3187. My patch breaks compatibility 
and users don't bother to unencodable filenames. Eg. glob("*.mp3") should not 
fail if the directory contains a temporary unencodable filename ("xxx.tmp").

History
Date	User	Action	Args
2010-09-12 21:14:56	vstinner	set	recipients: + vstinner, loewis
2010-09-12 21:14:54	vstinner	link	issue9820 messages
2010-09-12 21:14:53	vstinner	create