Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029

vstinner · 2010-09-10T10:42:58Z

BPO	9820
Nosy	@loewis, @vstinner, @bitdancer
Files	listdir_windows_bytes.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-09-13.19:11:21.703>
created_at = <Date 2010-09-10.10:42:58.310>
labels = ['invalid', 'library', 'expert-unicode', 'OS-windows']
title = "Windows : os.listdir(b'.') doesn't raise an\terrorfor unencodablefilenames"
updated_at = <Date 2010-09-13.19:11:21.702>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-09-13.19:11:21.702>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-09-13.19:11:21.703>
closer = 'vstinner'
components = ['Library (Lib)', 'Unicode', 'Windows']
creation = <Date 2010-09-10.10:42:58.310>
creator = 'vstinner'
dependencies = []
files = ['18836']
hgrepos = []
issue_num = 9820
keywords = ['patch']
message_count = 17.0
messages = ['115995', '115996', '116002', '116043', '116209', '116210', '116220', '116228', '116230', '116232', '116233', '116235', '116295', '116296', '116299', '116300', '116340']
nosy_count = 3.0
nosy_names = ['loewis', 'vstinner', 'r.david.murray']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue9820'
versions = ['Python 3.2']

vstinner · 2010-09-10T10:42:58Z

In Python 3.2, mbcs encoding (default filesystem encoding on Windows) is now strict: raise an error on unencodable/undecodable characters/bytes. But os.listdir(b'.') encodes unencodable bytes as b'?'.

Example:

>>> os.mkdir('listdir')
>>> open('listdir\\xxx-\u0363', 'w').close()
>>> filename = os.listdir(b'listdir')[0]
>>> filename
b'xxx-?'
>>> open(filename, 'r').close()
IOError: [Errno 22] Invalid argument: 'xxx-?'

os.listdir(b'listdir') should raise an error (and not ignore the filename or replaces unencodable characters by b'?').

I think that we should list the directory using the wide character API (FindFirstFileW) but encode the filename using PyUnicode_EncodeFSDefault() if the directory name type is bytes, instead of using the ANSI API (FindFirstFileA).

vstinner · 2010-09-10T10:45:13Z

I found this bug while trying to find an unencodable filename for bpo-9819 (TESTFN_UNDECODABLE).

Anyway, the bytes API should be avoided on Windows since Windows native filename type is unicode.

vstinner · 2010-09-10T11:06:25Z

os.listdir(b'listdir') should raise an error (and not ignore
the filename or replaces unencodable characters by b'?').

To avoid the error, a solution is to support the PEP-383 on Windows (for the mbcs encoding). I opened a separated issue for that: bpo-9821.

But support PEP-383 will not fix this issue because the current implementation of listdir(b'.') doesn't use the Python codec, but use raw bytes filenames (use the ANSI API).

vstinner · 2010-09-10T21:30:20Z

Patch:

Remove the bytes version of listdir(): reuse the unicode version but converts the filename to bytes using PyUnicode_EncodeFSDefault() if the directory name is not unicode
use Py_XDECREF(d) instead of Py_DECREF(d) at the end (because d=NULL on error)
use Py_CLEAR(d) instead of "Py_DECREF(d); d=NULL;"
remove "char namebuf[MAX_PATH+5]" buffer (use less stack memory)

loewis · 2010-09-12T16:02:11Z

What do you gain with this patch? (i.e. what is its advantage?)

vstinner · 2010-09-12T16:18:47Z

What do you gain with this patch? (i.e. what is its advantage?)

You know directly that os.listdir(bytes) is unable to encode the filename, instead of manipulate an invalid filename (b'?') and get the error later (when you use the filename: open, copy, delete, ... the file).

It's the same idea than str+bytes raises an error on Python3: get the error earlier instead of store invalid data and get the error to late.

Anywy, on Windows, it's not a good idea to manipulate bytes filenames. So it's also a way to encourage people to migrate their applications to unicode on Windows.

loewis · 2010-09-12T17:12:18Z

You know directly that os.listdir(bytes) is unable to encode the
filename, instead of manipulate an invalid filename (b'?') and get
the error later (when you use the filename: open, copy, delete, ...
the file).

Ok. Then I'm -1 on the patch: you can't know whether the application
actually wants to open the file. Perhaps it only wants to display the
file names, or perhaps it only wants to open some of the files, or
only traverse into subdirectories.

For backwards compatibility, I recommend to leave things as they are.
FindFirst/NextFileA will also do some other interesting conversions,
such as the best-fit conversion (which the "mbcs" code doesn't do
(anymore?)).

Windows has explicit A and W versions, and Python has explicit A
and W types, so it's IMO best to pair them in the natural way
(even if that means code duplication).

Anywy, on Windows, it's not a good idea to manipulate bytes
filenames. So it's also a way to encourage people to migrate their
applications to unicode on Windows.

Only if people run into the issue (which few people will). People
which *do* run into the issue will likely get an error either
way, which will teach them their lesson :-)

vstinner · 2010-09-12T20:35:24Z

FindFirst/NextFileA will also do some other interesting conversions,
such as the best-fit conversion (which the "mbcs" code doesn't do
(anymore?)).

About mbcs, mbcs codec of Python 3.1 is like .encode('mbcs', 'replace') and
.decode('mbcs', 'ignore') of Python 3.2 (see issue bpo-850997). By default
(strict error handler), it now raises errors on undecodable byte sequence and
unencodable character, whereas Python 3.1 just ignores the error handler.

PyUnicode_EncodeFSDefault / PyUnicode_DecodeFSDefault uses the strict error
handler.

I just added a note about mbcs in Doc/whatsnew/3.2.rst: r84750.

vstinner · 2010-09-12T21:14:53Z

It remembers me the discussion of the issue bpo-3187. About unencodable filenames,
Guido proposed to ignore them or to use errors="replace", and wrote "Failing
the entire os.listdir() call is not acceptable". (... long discussion ...) And
finally, os.listdir() ignored undecodable filenames on UNIX/BSD.

Then you introduced the genious PEP-383 (utf8b then renamed surrogateescape)
and os.listdir() now raises an error if the PyUnicode_FromEncodedObject(v,
Py_FileSystemDefaultEncoding, "surrogateescape") fails... which doesn't occur
because of undecodable byte sequence, but for other reasons like a memory
error.

About Windows, os.listdir(str) never fails, but my question is about
os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded
with "mbcs+replace", filenames not usable to open, rename or delete the file) or
just ignore them?

Ok. Then I'm -1 on the patch: you can't know whether the application
actually wants to open the file. Perhaps it only wants to display the
file names, or perhaps it only wants to open some of the files, or
only traverse into subdirectories.

For backwards compatibility, I recommend to leave things as they are.
FindFirst/NextFileA will also do some other interesting conversions,
such as the best-fit conversion (which the "mbcs" code doesn't do
(anymore?)).

"it only wants to open some of the files" is the typical reason for which I
hate Python2 and its implicit conversion between bytes and characters: it
works in most cases, but it fails "sometimes". The problem is to define (and
explain) "sometimes".

The typical use case of listing a directory is a file chooser. On Windows using
the bytes API, it works in most cases, but it fails if the user picks the
"wrong" file (name with "?"). That's the problem I would like to address.

--

Ignore unencodable filenames solution is compatible with the "traverse into
subdirectories" case. And it does also keep backward compatibility (except
that unencodable files are hidden, which is a least problem I think).

--

I proposed to raise an error on unencodable filename. I changed my mind after
reading your answer and the discussion on bpo-3187. My patch breaks compatibility
and users don't bother to unencodable filenames. Eg. glob("*.mp3") should not
fail if the directory contains a temporary unencodable filename ("xxx.tmp").

vstinner · 2010-09-12T21:20:09Z

FindFirst/NextFileA will also do some other interesting conversions,
such as the best-fit conversion (which the "mbcs" code doesn't do
(anymore?)).

If we choose to keep this behaviour, I will have to revert my commit on mbcs
codec to be consistent with os.listdir(). Or at least patch
PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and
PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).

loewis · 2010-09-12T21:30:18Z

About Windows, os.listdir(str) never fails, but my question is about
os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded
with "mbcs+replace", filenames not usable to open, rename or delete the file) or
just ignore them?

I see nothing wrong with returning incorrect file names.

"it only wants to open some of the files" is the typical reason for which I
hate Python2 and its implicit conversion between bytes and characters: it
works in most cases, but it fails "sometimes". The problem is to define (and
explain) "sometimes".

Notice that this doesn't change with the patch. It will *still* work
sometimes, and fail sometimes. In fact, for most users and most
applications, it will never fail - *even with your patch applied*.

Ignore unencodable filenames solution is compatible with the "traverse into
subdirectories" case. And it does also keep backward compatibility (except
that unencodable files are hidden, which is a least problem I think).

I fail to see why removing incorrect file names from the result list is
any better than keeping them. The result list will be incorrect either way.

In one case (files skipped), the user will not see the file in the
selection dialog, even though he knows its there and explorer shows it
just fine. So he thinks there must be a bug.

In the other case, it displays a non-sensical file name. Again, the user
thinks there is a bug - plus if you click on the file, you get some
error message (hopefully, the application will catch the exception -
the directory may also have changed in-between, so a missing file
error must be recovered from).

So it's a user-visible bug in either case, but if the incorrect file
name is included, it's slightly more obvious that something is wrong.

loewis · 2010-09-12T21:45:00Z

If we choose to keep this behaviour, I will have to revert my commit on mbcs
codec to be consistent with os.listdir(). Or at least patch
PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and
PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).

I think trying to emulate, in Python, what the *A functions do is
futile. IIUC, disables WC_NO_BEST_FIT_CHARS, and may do other stuff
which apparently is undocumented.

However, I fail to see the relationship to this issue. Having the MBCS
codec support strict mode is a good thing.

vstinner · 2010-09-13T11:50:44Z

I fail to see why removing incorrect file names from the result
list is any better than keeping them. The result list will
be incorrect either way.

It depends if you focus on displaying the content of the directory, or on processing files and directories. If you focus on display, yes, missing files an be seen as a bug. But if you walk into directories (use cases: os.walk(), replace a text pattern in all files (~os.glob), ...), and the function raises an error (because a directory or a file name is invalid) is worse. I mean the user have to rename all unencodable names, or the devfeloper have to patch its application to catch IOError and ignore the specific IOError(22).

If listdir() ignores unencodable names, os.walk() doesn't fail, but it misses some subdirectories and files.

--

Another (worse?) idea: deny bytes path for os.listdir(), but I suppose that we will not like the idea ;-)

vstinner · 2010-09-13T12:09:42Z

I think trying to emulate, in Python, what the *A functions
do is futile.

My problem is that some functions will use mbcs in strict mode (functions using PyUnicode_EncodeFSDefault): raise UnicodeEncodeError, and other will use mbcs in replace mode (functions using Windows functions in ANSI mode): raise IOError (or other error depending on the function). It's inconsistent. We should try to keep the same behaviour for all functions.

Examples of functions using (indirectly) PyUnicode_EncodeFSDefault to encode unicode filenames: bz2.BZ2File() and _ssl.SSLContext.load_cert_chain().

bitdancer · 2010-09-13T12:27:48Z

After the decision to ignore undecodable file names in os.listdir but before PEP-383 there was a long discussion on python-dev (in which I was a participant) about how horrible just ignoring the undecodable filenames was. This applies *especially* to the os.walk case, where some files would be mysteriously skipped and it wouldn't be obvious why.
Or even obvious that they'd been skipped, in some cases. The biggest issue was that the developer would likely never see the problem since the bulk of developers don't encounter encoding issues, so it would be the poor end user who would be confronted with the mystery, with no clues as to the cause or solution.

The conclusion of that particular thread was that Guido approved adding warning messages for filenames that were undecodable, but otherwise leaving os.listdir unchanged. Fortunately Martin came up with PEP-383, which solved the underlying problem in a better way.

So, I don't think that skipping the undecodable names is good, unless you generate a warning. In that thread I started out advocating raising an error, but in this case as Martin points out that would be a backward compatibility issue. Returning the munged filenames and having the error show up when the broken filename is used seems OK to me, even if imperfect. When the user sees the problem, they report it to the developer as a bug, who hopefully changes his code to use strings.

Adding warning messages would probably be useless at best and annoying at worst on Windows. Maybe we could add a pseudo deprecation warning (ie: aimed at developers, silent by default) that says "don't use listdir with bytes on windows"?

bitdancer · 2010-09-13T12:36:08Z

But in the case of BZ2File and ssl.SSLContext.load_cert_chain(), isn't it the case that they are trying to open the files? So producing an early error about the decoding problem makes sense. Are there any functions other than listdir where the decoded filenames are not necessarily immediately used to manipulate the files?

vstinner · 2010-09-13T19:11:22Z

ignore unencodable filenames is not a good idea
raise an error on unencodable filenames breaks backward compatibility
I don't think that emit a warning will change anything

Even if I don't like mbcs+replace (current behaviour of os.listdir(bytes)), I now agree that it is the less worst solution. I close this issue as invalid.

vstinner added stdlib Python modules in the Lib dir topic-unicode OS-windows labels Sep 10, 2010

loewis mannequin changed the title ~~Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames~~ Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames Sep 12, 2010

vstinner changed the title ~~Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames~~ Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames Sep 12, 2010

loewis mannequin changed the title ~~Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames~~ Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames Sep 12, 2010

vstinner closed this as completed Sep 13, 2010

vstinner added the invalid label Sep 13, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029

Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

loewis mannequin commented Sep 12, 2010

vstinner commented Sep 12, 2010

loewis mannequin commented Sep 12, 2010

vstinner commented Sep 12, 2010

vstinner commented Sep 12, 2010

vstinner commented Sep 12, 2010

loewis mannequin commented Sep 12, 2010

loewis mannequin commented Sep 12, 2010

vstinner commented Sep 13, 2010

vstinner commented Sep 13, 2010

bitdancer commented Sep 13, 2010

bitdancer commented Sep 13, 2010

vstinner commented Sep 13, 2010

Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029

Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029

Comments

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

vstinner commented Sep 10, 2010

loewis mannequin commented Sep 12, 2010

vstinner commented Sep 12, 2010

loewis mannequin commented Sep 12, 2010

vstinner commented Sep 12, 2010

vstinner commented Sep 12, 2010

vstinner commented Sep 12, 2010

loewis mannequin commented Sep 12, 2010

loewis mannequin commented Sep 12, 2010

vstinner commented Sep 13, 2010

vstinner commented Sep 13, 2010

bitdancer commented Sep 13, 2010

bitdancer commented Sep 13, 2010

vstinner commented Sep 13, 2010