New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows : os.listdir(b'.') doesn't raise an errorfor unencodablefilenames #54029
Comments
In Python 3.2, mbcs encoding (default filesystem encoding on Windows) is now strict: raise an error on unencodable/undecodable characters/bytes. But os.listdir(b'.') encodes unencodable bytes as b'?'. Example: >>> os.mkdir('listdir')
>>> open('listdir\\xxx-\u0363', 'w').close()
>>> filename = os.listdir(b'listdir')[0]
>>> filename
b'xxx-?'
>>> open(filename, 'r').close()
IOError: [Errno 22] Invalid argument: 'xxx-?' os.listdir(b'listdir') should raise an error (and not ignore the filename or replaces unencodable characters by b'?'). I think that we should list the directory using the wide character API (FindFirstFileW) but encode the filename using PyUnicode_EncodeFSDefault() if the directory name type is bytes, instead of using the ANSI API (FindFirstFileA). |
I found this bug while trying to find an unencodable filename for bpo-9819 (TESTFN_UNDECODABLE). Anyway, the bytes API should be avoided on Windows since Windows native filename type is unicode. |
To avoid the error, a solution is to support the PEP-383 on Windows (for the mbcs encoding). I opened a separated issue for that: bpo-9821. But support PEP-383 will not fix this issue because the current implementation of listdir(b'.') doesn't use the Python codec, but use raw bytes filenames (use the ANSI API). |
Patch:
|
What do you gain with this patch? (i.e. what is its advantage?) |
You know directly that os.listdir(bytes) is unable to encode the filename, instead of manipulate an invalid filename (b'?') and get the error later (when you use the filename: open, copy, delete, ... the file). It's the same idea than str+bytes raises an error on Python3: get the error earlier instead of store invalid data and get the error to late. Anywy, on Windows, it's not a good idea to manipulate bytes filenames. So it's also a way to encourage people to migrate their applications to unicode on Windows. |
Ok. Then I'm -1 on the patch: you can't know whether the application For backwards compatibility, I recommend to leave things as they are. Windows has explicit A and W versions, and Python has explicit A
Only if people run into the issue (which few people will). People |
About mbcs, mbcs codec of Python 3.1 is like .encode('mbcs', 'replace') and PyUnicode_EncodeFSDefault / PyUnicode_DecodeFSDefault uses the strict error I just added a note about mbcs in Doc/whatsnew/3.2.rst: r84750. |
It remembers me the discussion of the issue bpo-3187. About unencodable filenames, Then you introduced the genious PEP-383 (utf8b then renamed surrogateescape) About Windows, os.listdir(str) never fails, but my question is about
"it only wants to open some of the files" is the typical reason for which I The typical use case of listing a directory is a file chooser. On Windows using -- Ignore unencodable filenames solution is compatible with the "traverse into -- I proposed to raise an error on unencodable filename. I changed my mind after |
If we choose to keep this behaviour, I will have to revert my commit on mbcs |
I see nothing wrong with returning incorrect file names.
Notice that this doesn't change with the patch. It will *still* work
I fail to see why removing incorrect file names from the result list is In one case (files skipped), the user will not see the file in the In the other case, it displays a non-sensical file name. Again, the user So it's a user-visible bug in either case, but if the incorrect file |
I think trying to emulate, in Python, what the *A functions do is However, I fail to see the relationship to this issue. Having the MBCS |
It depends if you focus on displaying the content of the directory, or on processing files and directories. If you focus on display, yes, missing files an be seen as a bug. But if you walk into directories (use cases: os.walk(), replace a text pattern in all files (~os.glob), ...), and the function raises an error (because a directory or a file name is invalid) is worse. I mean the user have to rename all unencodable names, or the devfeloper have to patch its application to catch IOError and ignore the specific IOError(22). If listdir() ignores unencodable names, os.walk() doesn't fail, but it misses some subdirectories and files. -- Another (worse?) idea: deny bytes path for os.listdir(), but I suppose that we will not like the idea ;-) |
My problem is that some functions will use mbcs in strict mode (functions using PyUnicode_EncodeFSDefault): raise UnicodeEncodeError, and other will use mbcs in replace mode (functions using Windows functions in ANSI mode): raise IOError (or other error depending on the function). It's inconsistent. We should try to keep the same behaviour for all functions. Examples of functions using (indirectly) PyUnicode_EncodeFSDefault to encode unicode filenames: bz2.BZ2File() and _ssl.SSLContext.load_cert_chain(). |
After the decision to ignore undecodable file names in os.listdir but before PEP-383 there was a long discussion on python-dev (in which I was a participant) about how horrible just ignoring the undecodable filenames was. This applies *especially* to the os.walk case, where some files would be mysteriously skipped and it wouldn't be obvious why. The conclusion of that particular thread was that Guido approved adding warning messages for filenames that were undecodable, but otherwise leaving os.listdir unchanged. Fortunately Martin came up with PEP-383, which solved the underlying problem in a better way. So, I don't think that skipping the undecodable names is good, unless you generate a warning. In that thread I started out advocating raising an error, but in this case as Martin points out that would be a backward compatibility issue. Returning the munged filenames and having the error show up when the broken filename is used seems OK to me, even if imperfect. When the user sees the problem, they report it to the developer as a bug, who hopefully changes his code to use strings. Adding warning messages would probably be useless at best and annoying at worst on Windows. Maybe we could add a pseudo deprecation warning (ie: aimed at developers, silent by default) that says "don't use listdir with bytes on windows"? |
But in the case of BZ2File and ssl.SSLContext.load_cert_chain(), isn't it the case that they are trying to open the files? So producing an early error about the decoding problem makes sense. Are there any functions other than listdir where the decoded filenames are not necessarily immediately used to manipulate the files? |
Even if I don't like mbcs+replace (current behaviour of os.listdir(bytes)), I now agree that it is the less worst solution. I close this issue as invalid. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: