Issue 9820: Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54029

classification

Title:	Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames
Type:		Stage:
Components:	Library (Lib), Unicode, Windows	Versions:	Python 3.2

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	loewis, r.david.murray, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-09-10 10:42 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
listdir_windows_bytes.patch	vstinner, 2010-09-10 21:30

Messages (17)
msg115995 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 10:42
In Python 3.2, mbcs encoding (default filesystem encoding on Windows) is now strict: raise an error on unencodable/undecodable characters/bytes. But os.listdir(b'.') encodes unencodable bytes as b'?'. Example: >>> os.mkdir('listdir') >>> open('listdir\\xxx-\u0363', 'w').close() >>> filename = os.listdir(b'listdir')[0] >>> filename b'xxx-?' >>> open(filename, 'r').close() IOError: [Errno 22] Invalid argument: 'xxx-?' os.listdir(b'listdir') should raise an error (and not ignore the filename or replaces unencodable characters by b'?'). I think that we should list the directory using the wide character API (FindFirstFileW) but encode the filename using PyUnicode_EncodeFSDefault() if the directory name type is bytes, instead of using the ANSI API (FindFirstFileA).
msg115996 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 10:45
I found this bug while trying to find an unencodable filename for #9819 (TESTFN_UNDECODABLE). Anyway, the bytes API should be avoided on Windows since Windows native filename type is unicode.
msg116002 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 11:06
> os.listdir(b'listdir') should raise an error (and not ignore > the filename or replaces unencodable characters by b'?'). To avoid the error, a solution is to support the PEP 383 on Windows (for the mbcs encoding). I opened a separated issue for that: #9821. But support PEP 383 will not fix this issue because the current implementation of listdir(b'.') doesn't use the Python codec, but use raw bytes filenames (use the ANSI API).
msg116043 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 21:30
Patch: - Remove the bytes version of listdir(): reuse the unicode version but converts the filename to bytes using PyUnicode_EncodeFSDefault() if the directory name is not unicode - use Py_XDECREF(d) instead of Py_DECREF(d) at the end (because d=NULL on error) - use Py_CLEAR(d) instead of "Py_DECREF(d); d=NULL;" - remove "char namebuf[MAX_PATH+5]" buffer (use less stack memory)
msg116209 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-09-12 16:02
What do you gain with this patch? (i.e. what is its advantage?)
msg116210 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-12 16:18
> What do you gain with this patch? (i.e. what is its advantage?) You know directly that os.listdir(bytes) is unable to encode the filename, instead of manipulate an invalid filename (b'?') and get the error later (when you use the filename: open, copy, delete, ... the file). It's the same idea than str+bytes raises an error on Python3: get the error earlier instead of store invalid data and get the error to late. Anywy, on Windows, it's not a good idea to manipulate bytes filenames. So it's also a way to encourage people to migrate their applications to unicode on Windows.
msg116220 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-09-12 17:12
> You know directly that os.listdir(bytes) is unable to encode the > filename, instead of manipulate an invalid filename (b'?') and get > the error later (when you use the filename: open, copy, delete, ... > the file). Ok. Then I'm -1 on the patch: you can't know whether the application actually wants to open the file. Perhaps it only wants to display the file names, or perhaps it only wants to open some of the files, or only traverse into subdirectories. For backwards compatibility, I recommend to leave things as they are. FindFirst/NextFileA will also do some other interesting conversions, such as the best-fit conversion (which the "mbcs" code doesn't do (anymore?)). Windows has explicit A and W versions, and Python has explicit A and W types, so it's IMO best to pair them in the natural way (even if that means code duplication). > Anywy, on Windows, it's not a good idea to manipulate bytes > filenames. So it's also a way to encourage people to migrate their > applications to unicode on Windows. Only if people run into the issue (which few people will). People which do run into the issue will likely get an error either way, which will teach them their lesson :-)
msg116228 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-12 20:35
> FindFirst/NextFileA will also do some other interesting conversions, > such as the best-fit conversion (which the "mbcs" code doesn't do > (anymore?)). About mbcs, mbcs codec of Python 3.1 is like .encode('mbcs', 'replace') and .decode('mbcs', 'ignore') of Python 3.2 (see issue #850997). By default (strict error handler), it now raises errors on undecodable byte sequence and unencodable character, whereas Python 3.1 just ignores the error handler. PyUnicode_EncodeFSDefault / PyUnicode_DecodeFSDefault uses the strict error handler. I just added a note about mbcs in Doc/whatsnew/3.2.rst: r84750.
msg116230 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-12 21:14
It remembers me the discussion of the issue #3187. About unencodable filenames, Guido proposed to ignore them or to use errors="replace", and wrote "Failing the entire os.listdir() call is not acceptable". (... long discussion ...) And finally, os.listdir() ignored undecodable filenames on UNIX/BSD. Then you introduced the genious PEP 383 (utf8b then renamed surrogateescape) and os.listdir() now raises an error if the PyUnicode_FromEncodedObject(v, Py_FileSystemDefaultEncoding, "surrogateescape") fails... which doesn't occur because of undecodable byte sequence, but for other reasons like a memory error. About Windows, os.listdir(str) never fails, but my question is about os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded with "mbcs+replace", filenames not usable to open, rename or delete the file) or just ignore them? > Ok. Then I'm -1 on the patch: you can't know whether the application > actually wants to open the file. Perhaps it only wants to display the > file names, or perhaps it only wants to open some of the files, or > only traverse into subdirectories. > > For backwards compatibility, I recommend to leave things as they are. > FindFirst/NextFileA will also do some other interesting conversions, > such as the best-fit conversion (which the "mbcs" code doesn't do > (anymore?)). "it only wants to open some of the files" is the typical reason for which I hate Python2 and its implicit conversion between bytes and characters: it works in most cases, but it fails "sometimes". The problem is to define (and explain) "sometimes". The typical use case of listing a directory is a file chooser. On Windows using the bytes API, it works in most cases, but it fails if the user picks the "wrong" file (name with "?"). That's the problem I would like to address. -- Ignore unencodable filenames solution is compatible with the "traverse into subdirectories" case. And it does also keep backward compatibility (except that unencodable files are hidden, which is a least problem I think). -- I proposed to raise an error on unencodable filename. I changed my mind after reading your answer and the discussion on #3187. My patch breaks compatibility and users don't bother to unencodable filenames. Eg. glob("*.mp3") should not fail if the directory contains a temporary unencodable filename ("xxx.tmp").
msg116232 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-12 21:20
> FindFirst/NextFileA will also do some other interesting conversions, > such as the best-fit conversion (which the "mbcs" code doesn't do > (anymore?)). If we choose to keep this behaviour, I will have to revert my commit on mbcs codec to be consistent with os.listdir(). Or at least patch PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler).
msg116233 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-09-12 21:30
> About Windows, os.listdir(str) never fails, but my question is about > os.listdir(bytes). Should os.listdir(bytes) returns invalid filenames (encoded > with "mbcs+replace", filenames not usable to open, rename or delete the file) or > just ignore them? I see nothing wrong with returning incorrect file names. > "it only wants to open some of the files" is the typical reason for which I > hate Python2 and its implicit conversion between bytes and characters: it > works in most cases, but it fails "sometimes". The problem is to define (and > explain) "sometimes". Notice that this doesn't change with the patch. It will still work sometimes, and fail sometimes. In fact, for most users and most applications, it will never fail - even with your patch applied. > Ignore unencodable filenames solution is compatible with the "traverse into > subdirectories" case. And it does also keep backward compatibility (except > that unencodable files are hidden, which is a least problem I think). I fail to see why removing incorrect file names from the result list is any better than keeping them. The result list will be incorrect either way. In one case (files skipped), the user will not see the file in the selection dialog, even though he knows its there and explorer shows it just fine. So he thinks there must be a bug. In the other case, it displays a non-sensical file name. Again, the user thinks there is a bug - plus if you click on the file, you get some error message (hopefully, the application will catch the exception - the directory may also have changed in-between, so a missing file error must be recovered from). So it's a user-visible bug in either case, but if the incorrect file name is included, it's slightly more obvious that something is wrong.
msg116235 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-09-12 21:44
> If we choose to keep this behaviour, I will have to revert my commit on mbcs > codec to be consistent with os.listdir(). Or at least patch > PyUnicode_EncodeFSDefault and os.fsencode() (use replace error handler) and > PyUnicode_DecodeFSDefault and os.fsdecode() (use igrore error handler). I think trying to emulate, in Python, what the *A functions do is futile. IIUC, disables WC_NO_BEST_FIT_CHARS, and may do other stuff which apparently is undocumented. However, I fail to see the relationship to this issue. Having the MBCS codec support strict mode is a good thing.
msg116295 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-13 11:50
> I fail to see why removing incorrect file names from the result > list is any better than keeping them. The result list will > be incorrect either way. It depends if you focus on displaying the content of the directory, or on processing files and directories. If you focus on display, yes, missing files an be seen as a bug. But if you walk into directories (use cases: os.walk(), replace a text pattern in all files (~os.glob), ...), and the function raises an error (because a directory or a file name is invalid) is worse. I mean the user have to rename all unencodable names, or the devfeloper have to patch its application to catch IOError and ignore the specific IOError(22). If listdir() ignores unencodable names, os.walk() doesn't fail, but it misses some subdirectories and files. -- Another (worse?) idea: deny bytes path for os.listdir(), but I suppose that we will not like the idea ;-)
msg116296 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-13 12:09
> I think trying to emulate, in Python, what the *A functions > do is futile. My problem is that some functions will use mbcs in strict mode (functions using PyUnicode_EncodeFSDefault): raise UnicodeEncodeError, and other will use mbcs in replace mode (functions using Windows functions in ANSI mode): raise IOError (or other error depending on the function). It's inconsistent. We should try to keep the same behaviour for all functions. Examples of functions using (indirectly) PyUnicode_EncodeFSDefault to encode unicode filenames: bz2.BZ2File() and _ssl.SSLContext.load_cert_chain().
msg116299 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-09-13 12:27
After the decision to ignore undecodable file names in os.listdir but before PEP 383 there was a long discussion on python-dev (in which I was a participant) about how horrible just ignoring the undecodable filenames was. This applies especially to the os.walk case, where some files would be mysteriously skipped and it wouldn't be obvious why. Or even obvious that they'd been skipped, in some cases. The biggest issue was that the developer would likely never see the problem since the bulk of developers don't encounter encoding issues, so it would be the poor end user who would be confronted with the mystery, with no clues as to the cause or solution. The conclusion of that particular thread was that Guido approved adding warning messages for filenames that were undecodable, but otherwise leaving os.listdir unchanged. Fortunately Martin came up with PEP 383, which solved the underlying problem in a better way. So, I don't think that skipping the undecodable names is good, unless you generate a warning. In that thread I started out advocating raising an error, but in this case as Martin points out that would be a backward compatibility issue. Returning the munged filenames and having the error show up when the broken filename is used seems OK to me, even if imperfect. When the user sees the problem, they report it to the developer as a bug, who hopefully changes his code to use strings. Adding warning messages would probably be useless at best and annoying at worst on Windows. Maybe we could add a pseudo deprecation warning (ie: aimed at developers, silent by default) that says "don't use listdir with bytes on windows"?
msg116300 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-09-13 12:36
But in the case of BZ2File and ssl.SSLContext.load_cert_chain(), isn't it the case that they are trying to open the files? So producing an early error about the decoding problem makes sense. Are there any functions other than listdir where the decoded filenames are not necessarily immediately used to manipulate the files?
msg116340 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-13 19:11
- ignore unencodable filenames is not a good idea - raise an error on unencodable filenames breaks backward compatibility - I don't think that emit a warning will change anything Even if I don't like mbcs+replace (current behaviour of os.listdir(bytes)), I now agree that it is the less worst solution. I close this issue as invalid.

History
Date	User	Action	Args
2022-04-11 14:57:06	admin	set	github: 54029
2010-09-13 19:11:21	vstinner	set	status: open -> closed resolution: not a bug messages: + msg116340
2010-09-13 12:36:08	r.david.murray	set	messages: + msg116300
2010-09-13 12:27:49	r.david.murray	set	nosy: + r.david.murray messages: + msg116299
2010-09-13 12:09:43	vstinner	set	messages: + msg116296
2010-09-13 11:50:45	vstinner	set	messages: + msg116295
2010-09-12 21:45:00	loewis	set	messages: + msg116235
2010-09-12 21:30:18	loewis	set	messages: + msg116233 title: Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames -> Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames
2010-09-12 21:20:08	vstinner	set	messages: + msg116232
2010-09-12 21:14:54	vstinner	set	messages: + msg116230
2010-09-12 20:35:24	vstinner	set	messages: + msg116228 title: Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames -> Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames
2010-09-12 17:12:18	loewis	set	messages: + msg116220 title: Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames -> Windows : os.listdir(b'.') doesn't raise an error for unencodable filenames
2010-09-12 16:18:47	vstinner	set	messages: + msg116210
2010-09-12 16:02:10	loewis	set	messages: + msg116209
2010-09-10 21:30:21	vstinner	set	files: + listdir_windows_bytes.patch keywords: + patch messages: + msg116043
2010-09-10 11:06:24	vstinner	set	messages: + msg116002
2010-09-10 10:45:13	vstinner	set	messages: + msg115996
2010-09-10 10:42:58	vstinner	create