classification
Title: Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames
Type: Stage:
Components: Library (Lib), Unicode, Windows Versions: Python 3.2
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: loewis, r.david.murray, vstinner
Priority: normal Keywords: patch

Created on 2010-09-10 10:42 by vstinner, last changed 2010-09-13 19:11 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
listdir_windows_bytes.patch vstinner, 2010-09-10 21:30
Messages (17)
msg115995 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 10:42
In Python 3.2, mbcs encoding (default filesystem encoding on Windows) is now strict: raise an error on unencodable/undecodable characters/bytes. But os.listdir(b'.') encodes unencodable bytes as b'?'.

Example:

>>> os.mkdir('listdir')
>>> open('listdir\\xxx-\u0363', 'w').close()
>>> filename = os.listdir(b'listdir')[0]
>>> filename
b'xxx-?'
>>> open(filename, 'r').close()
IOError: [Errno 22] Invalid argument: 'xxx-?'

os.listdir(b'listdir') should raise an error (and not ignore the filename or replaces unencodable characters by b'?').

I think that we should list the directory using the wide character API (FindFirstFileW) but encode the filename using PyUnicode_EncodeFSDefault() if the directory name type is bytes, instead of using the ANSI API (FindFirstFileA).
msg115996 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 10:45
I found this bug while trying to find an unencodable filename for #9819 (TESTFN_UNDECODABLE).

Anyway, the bytes API should be avoided on Windows since Windows native filename type is unicode.
msg116002 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 11:06
> os.listdir(b'listdir') should raise an error (and not ignore 
> the filename or replaces unencodable characters by b'?').

To avoid the error, a solution is to support the PEP 383 on Windows (for the mbcs encoding). I opened a separated issue for that: #9821.

But support PEP 383 will not fix this issue because the current implementation of listdir(b'.') doesn't use the Python codec, but use raw bytes filenames (use the ANSI API).
msg116043 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 21:30
Patch:
 - Remove the bytes version of listdir(): reuse the unicode version but converts the filename to bytes using PyUnicode_EncodeFSDefault() if the directory name is not unicode
 - use Py_XDECREF(d) instead of Py_DECREF(d) at the end (because d=NULL on error)
 - use Py_CLEAR(d) instead of "Py_DECREF(d); d=NULL;"
 - remove "char namebuf[MAX_PATH+5]" buffer (use less stack memory)
msg116209 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-12 16:02
What do you gain with this patch? (i.e. what is its advantage?)
msg116210 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-12 16:18
> What do you gain with this patch? (i.e. what is its advantage?)

You know directly that os.listdir(bytes) is unable to encode the filename, instead of manipulate an invalid filename (b'?') and get the error later (when you use the filename: open, copy, delete, ... the file).

It's the same idea than str+bytes raises an error on Python3: get the error earlier instead of store invalid data and get the error to late.

Anywy, on Windows, it's not a good idea to manipulate bytes filenames. So it's also a way to encourage people to migrate their applications to unicode on Windows.
msg116220 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-12 17:12
> You know directly that os.listdir(bytes) is unable to encode the
> filename, instead of manipulate an invalid filename (b'?') and get
> the error later (when you use the filename: open, copy, delete, ...
> the file).

Ok. Then I'm -1 on the patch: you can't know whether the application
actually wants to open the file. Perhaps it only wants to display the
file names, or perhaps it only wants to open some of the files, or
only traverse into subdirectories.

For backwards compatibility, I recommend to leave things as they are.
FindFirst/NextFileA will also do some other interesting conversions,
such as the best-fit conversion (which the "mbcs" code doesn't do
(anymore?)).

Windows has explicit A and W versions, and Python has explicit A
and W types, so it's IMO best to pair them in the natural way
(even if that means code duplication).

> Anywy, on Windows, it's not a good idea to manipulate bytes
> filenames. So it's also a way to encourage people to migrate their
> applications to unicode on Windows.

Only if people run into the issue (which few people will). People
which *do* run into the issue will likely get an error either
way, which will teach them their lesson :-)
msg116228 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-12 20:35
> FindFirst/NextFileA will also do some other interesting conversions,
> such as the best-fit conversion (which the "mbcs" code doesn't do
> (anymore?)).

About mbcs, mbcs codec of Python 3.1 is like .encode('mbcs', 'replace') and 
.decode('mbcs', 'ignore') of Python 3.2 (see issue #850997). By default 
(strict error handler), it now raises errors on undecodable byte sequence and 
unencodable character, whereas Python 3.1 just ignores the error handler.

PyUnicode_EncodeFSDefault / PyUnicode_DecodeFSDefault uses the strict error 
handler.

I just added a note about mbcs in Doc/whatsnew/3.2.rst: r84750.
msg116230 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-12 21:14
It remembers me the discussion of the issue #3187. About unencodable filenames, 
Guido proposed to ignore them or to use errors="replace", and wrote "Failing 
the entire os.listdir() call is not acceptable". (... long discussion ...) And 
finally, os.listdir() ignored undecodable filenames on UNIX/BSD.

Then you introduced the genious PEP 383 (utf8b then renamed surrogateescape) 
and os.listdir() now raises an error if the PyUnicode_FromEncodedObject(v, 
Py_FileSystemDefaultEncoding, "surrogateescape") fails... which doesn't occur 
because of undecodable byte sequence, but for other reasons like a memory 
error.

About Windows, os.listdir(str) never fails, but my question is about 
os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded 
with "mbcs+replace", filenames not usable to open, rename or delete the file) or 
just ignore them?

> Ok. Then I'm -1 on the patch: you can't know whether the application
> actually wants to open the file. Perhaps it only wants to display the
> file names, or perhaps it only wants to open some of the files, or
> only traverse into subdirectories.
>
> For backwards compatibility, I recommend to leave things as they are.
> FindFirst/NextFileA will also do some other interesting conversions,
> such as the best-fit conversion (which the "mbcs" code doesn't do
> (anymore?)).

"it only wants to open some of the files" is the typical reason for which I 
hate Python2 and its implicit conversion between bytes and characters: it 
works in most cases, but it fails "sometimes". The problem is to define (and 
explain) "sometimes".

The typical use case of listing a directory is a file chooser. On Windows using 
the bytes API, it works in most cases, but it fails if the user picks the 
"wrong" file (name with "?"). That's the problem I would like to address.

--

Ignore unencodable filenames solution is compatible with the "traverse into 
subdirectories" case. And it does also keep backward compatibility (except 
that unencodable files are hidden, which is a least problem I think).

--

I proposed to raise an error on unencodable filename. I changed my mind after 
reading your answer and the discussion on #3187. My patch breaks compatibility 
and users don't bother to unencodable filenames. Eg. glob("*.mp3") should not 
fail if the directory contains a temporary unencodable filename ("xxx.tmp").
msg116232 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-12 21:20
> FindFirst/NextFileA will also do some other interesting conversions,
> such as the best-fit conversion (which the "mbcs" code doesn't do
> (anymore?)).

If we choose to keep this behaviour, I will have to revert my commit on mbcs 
codec to be consistent with os.listdir(). Or at least patch 
PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and 
PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).
msg116233 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-12 21:30
> About Windows, os.listdir(str) never fails, but my question is about 
> os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded 
> with "mbcs+replace", filenames not usable to open, rename or delete the file) or 
> just ignore them?

I see nothing wrong with returning incorrect file names.

> "it only wants to open some of the files" is the typical reason for which I 
> hate Python2 and its implicit conversion between bytes and characters: it 
> works in most cases, but it fails "sometimes". The problem is to define (and 
> explain) "sometimes".

Notice that this doesn't change with the patch. It will *still* work
sometimes, and fail sometimes. In fact, for most users and most
applications, it will never fail - *even with your patch applied*.

> Ignore unencodable filenames solution is compatible with the "traverse into 
> subdirectories" case. And it does also keep backward compatibility (except 
> that unencodable files are hidden, which is a least problem I think).

I fail to see why removing incorrect file names from the result list is
any better than keeping them. The result list will be incorrect either way.

In one case (files skipped), the user will not see the file in the
selection dialog, even though he knows its there and explorer shows it
just fine. So he thinks there must be a bug.

In the other case, it displays a non-sensical file name. Again, the user
thinks there is a bug - plus if you click on the file, you get some
error message (hopefully, the application will catch the exception -
the directory may also have changed in-between, so a missing file
error must be recovered from).

So it's a user-visible bug in either case, but if the incorrect file
name is included, it's slightly more obvious that something is wrong.
msg116235 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-12 21:44
> If we choose to keep this behaviour, I will have to revert my commit on mbcs 
> codec to be consistent with os.listdir(). Or at least patch 
> PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and 
> PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).

I think trying to emulate, in Python, what the *A functions do is
futile. IIUC, disables WC_NO_BEST_FIT_CHARS, and may do other stuff
which apparently is undocumented.

However, I fail to see the relationship to this issue. Having the MBCS
codec support strict mode is a good thing.
msg116295 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-13 11:50
> I fail to see why removing incorrect file names from the result
> list is any better than keeping them. The result list will 
> be incorrect either way.

It depends if you focus on displaying the content of the directory, or on processing files and directories. If you focus on display, yes, missing files an be seen as a bug. But if you walk into directories (use cases: os.walk(), replace a text pattern in all files (~os.glob), ...), and the function raises an error (because a directory or a file name is invalid) is worse. I mean the user have to rename all unencodable names, or the devfeloper have to patch its application to catch IOError and ignore the specific IOError(22).

If listdir() ignores unencodable names, os.walk() doesn't fail, but it misses some subdirectories and files.

--

Another (worse?) idea: deny bytes path for os.listdir(), but I suppose that we will not like the idea ;-)
msg116296 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-13 12:09
> I think trying to emulate, in Python, what the *A functions 
> do is futile.

My problem is that some functions will use mbcs in strict mode (functions using PyUnicode_EncodeFSDefault): raise UnicodeEncodeError, and other will use mbcs in replace mode (functions using Windows functions in ANSI mode): raise IOError (or other error depending on the function). It's inconsistent. We should try to keep the same behaviour for all functions.

Examples of functions using (indirectly) PyUnicode_EncodeFSDefault to encode unicode filenames: bz2.BZ2File() and _ssl.SSLContext.load_cert_chain().
msg116299 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-13 12:27
After the decision to ignore undecodable file names in os.listdir but before PEP 383 there was a long discussion on python-dev (in which I was a participant) about how horrible just ignoring the undecodable filenames was.  This applies *especially* to the os.walk case, where some files would be mysteriously skipped and it wouldn't be obvious why.
Or even obvious that they'd been skipped, in some cases.  The biggest issue was that the developer would likely never see the problem since the bulk of developers don't encounter encoding issues, so it would be the poor end user who would be confronted with the mystery, with no clues as to the cause or solution.

The conclusion of that particular thread was that Guido approved adding warning messages for filenames that were undecodable, but otherwise leaving os.listdir unchanged.  Fortunately Martin came up with PEP 383, which solved the underlying problem in a better way.

So, I don't think that skipping the undecodable names is good, unless you generate a warning.  In that thread I started out advocating raising an error, but in this case as Martin points out that would be a backward compatibility issue.  Returning the munged filenames and having the error show up when the broken filename is used seems OK to me, even if imperfect.  When the user sees the problem, they report it to the developer as a bug, who hopefully changes his code to use strings.

Adding warning messages would probably be useless at best and annoying at worst on Windows.  Maybe we could add a pseudo deprecation warning (ie: aimed at developers, silent by default) that says "don't use listdir with bytes on windows"?
msg116300 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-13 12:36
But in the case of BZ2File and ssl.SSLContext.load_cert_chain(), isn't it the case that they are trying to open the files?  So producing an early error about the decoding problem makes sense.  Are there any functions other than listdir where the decoded filenames are not necessarily immediately used to manipulate the files?
msg116340 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-13 19:11
- ignore unencodable filenames is not a good idea
- raise an error on unencodable filenames breaks backward compatibility
- I don't think that emit a warning will change anything

Even if I don't like mbcs+replace (current behaviour of os.listdir(bytes)), I now agree that it is the less worst solution. I close this issue as invalid.
History
Date User Action Args
2010-09-13 19:11:21vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg116340
2010-09-13 12:36:08r.david.murraysetmessages: + msg116300
2010-09-13 12:27:49r.david.murraysetnosy: + r.david.murray
messages: + msg116299
2010-09-13 12:09:43vstinnersetmessages: + msg116296
2010-09-13 11:50:45vstinnersetmessages: + msg116295
2010-09-12 21:45:00loewissetmessages: + msg116235
2010-09-12 21:30:18loewissetmessages: + msg116233
title: Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames -> Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames
2010-09-12 21:20:08vstinnersetmessages: + msg116232
2010-09-12 21:14:54vstinnersetmessages: + msg116230
2010-09-12 20:35:24vstinnersetmessages: + msg116228
title: Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames -> Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames
2010-09-12 17:12:18loewissetmessages: + msg116220
title: Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames -> Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames
2010-09-12 16:18:47vstinnersetmessages: + msg116210
2010-09-12 16:02:10loewissetmessages: + msg116209
2010-09-10 21:30:21vstinnersetfiles: + listdir_windows_bytes.patch
keywords: + patch
messages: + msg116043
2010-09-10 11:06:24vstinnersetmessages: + msg116002
2010-09-10 10:45:13vstinnersetmessages: + msg115996
2010-09-10 10:42:58vstinnercreate