New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: os.walk now using os.scandir() breaks bytes filenames on windows #70099
Comments
In py3.4 and below we used to be able to use bytes filenames with os.walk(), it’s now impossible under windows due to the limitation of os.scandir(). This issue was reported to Blender tracker (https://developer.blender.org/T47018). I assume this should be considered as a regression? At least, this should be mentioned in the documentation… Thanks, |
Perhaps, I'm looking in the wrong place, but there doesn't seem be any test for bytes filenames. |
Regressions are not cool, but it was a deliberate choice to not support Bytes filename are deprecated since python 3.2 if i recall correctly. It's I suggest to document the regression rather than adding bytes support to |
For softer transition we can return old os.walk() implementation for bytes (may be with a deprecation warning). Using os.scandir() is just an optimization, and this shouldn't break existing program. |
@Haypo, I checked available info online and couldn't find any reference to byte file-paths were deprecated since Python3.2. On the contrary, the 3.2 release notes 0 state: "countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)" ---- Since this is already working properly in 3.5 for other systems besides ms-windows, and worked in 3.4x. ---- |
The ANSI API is problematic because it returns a best-fit encoding to the system codepage. For example: >>> os.listdir('.')
['ƠƨưƸǀLjǐǘǠǨǰǸ']
>>> os.listdir(b'.')
[b'O?u?|?iu?Kj?'] To somewhat work around this problem, listdir and scandir could return the cAlternateFilename of the WIN32_FIND_DATA struct if present. This is the classic 8.3 short name that Microsoft file systems create for MS-DOS compatibility. For NTFS it can be disabled in the registry, or per volume, but I assume whoever does that knows what to expect. Also, since Python wouldn't need the short name for a wide-character path, there's no point in asking for it. (For NTFS it's a separate entry in the MFT. If it exists, which is known ahead of time, finding the entry requires a second lookup.) In this case it's better to call FindFirstFileExW and request only FindExInfoBasic. Generally the difference is inconsequential, but in a contrived example with 10000 similarly-named files from "ĀāĂă0000" to "ĀāĂă9999" and short names from "0000~1" to "9999~1", skipping the short name lookup shaved about 10% off the total time. For this test, I replaced the FindFirstFileW call in posix_scandir with the following call:
|
I like Serhiy's suggestion to use the old walk implementation on Windows if the argument is bytes, with a DeprecationWarning (like all other bytes operations in os on Windows). There was no warning that bytes paths would stop working with os.walk, so we should fix it and I'm bumping up the priority. bytes filenames have been deprecated since Python 3.3 (bpo-13374); I don't think scandir (new in 3.5) should support bytes. |
Python 3.5.0 and 3.5.1 are out. I don't think that it's worth to add back It's really a bad idea to use bytes on Windows. With Python 3, it becomes |
Just to clarify what Victor and Zachary mentioned: bytes filenames have been deprecated only *on Windows* since Python 3.3; bytes filenames are still fine on POSIX. |
Here is a patch that restores the ability of os.walk() to work with bytes filenames on Windows. |
Did we want a deprecation warning too? I don't see it in the patch. Otherwise, LGTM. |
Considering there's no plan to implement bytes paths for scandir on Windows, then the following line in the os docs needs to be modified: "All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned." It should be changed to acknowledge that some functions on Windows do not support bytes paths, and that existing support is deprecated and may be removed in a future release. Also, the docs for listdir should warn that using bytes returns invalid results on Windows when filenames contain characters that aren't mapped in the system locale codepage. |
There is open bpo-16700 for the documentation. |
Fixed a typo found by Eryk Sun. |
I didn't tested on Windows. If bytes paths considered deprecated on Windows, os.listdir() should emit a warning. Otherwise I don't think we should special case os.walk(). |
A related question, (realize this isn't a support forum), but what is the equivalent API calls in Python3.5 so we can update scripts. Should we be doing: os.walk(os.fsdecode(path)) ? From Python's documentation, os.fsdecode is using 'strict' handler which may raise exceptions. If this is not the best place to ask about this, then can post elsewhere. |
New changeset 5310f94772f4 by Serhiy Storchaka in branch '3.5': New changeset b060af2a58b6 by Serhiy Storchaka in branch 'default': |
While this change is ok for Python 3.5, I would prefer to *drop* support for bytes filenames on Windows. I started a thread on python-dev for that: |
Bastien:
Again, why do you use bytes? Blender only supports Python 3. You must use Unicode on all platforms. On Windows, it gives you support of the full Unicode character set for free. |
New tests emit deprecation warnings on Windows and failed. http://buildbot.python.org/all/builders/AMD64%20Windows8%203.x/builds/1753/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_os.py", line 894, in test_walk_bottom_up
self.sub2_tree)
AssertionError: Tuples differ: ('@test_4312_tmp\\TEST1\\SUB2', ['broken_link', 'link'], ['tmp3']) != ('@test_4312_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) First differing element 1:
+ ('@test_4312_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_os.py", line 874, in test_walk_prune
self.assertEqual(all[1], self.sub2_tree)
AssertionError: Tuples differ: ('@test_4312_tmp\\TEST1\\SUB2', ['broken_link', 'link'], ['tmp3']) != ('@test_4312_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) First differing element 1:
+ ('@test_4312_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) Proposed patch should silence warnings. Could anyone please test it on Windows? |
New changeset 9bffe39e8273 by Serhiy Storchaka in branch '3.5': |
New changeset 19a3e0e664af by Serhiy Storchaka in branch '3.4': New changeset f9e22717722d by Serhiy Storchaka in branch '3.5': New changeset da020e408c7f by Serhiy Storchaka in branch 'default': |
New changeset 757159fb4847 by Serhiy Storchaka in branch '3.5': New changeset ecf4e51c222f by Serhiy Storchaka in branch 'default': |
Deprecation warnings are suppressed, but my attempt to fix test failures failed. http://buildbot.python.org/all/builders/AMD64%20Windows8%203.x/builds/1832/steps/test/logs/stdio Traceback (most recent call last):
File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_os.py", line 894, in test_walk_bottom_up
self.sub2_tree)
AssertionError: Tuples differ: ('@test_4060_tmp\\TEST1\\SUB2', ['broken_link', 'link'], ['tmp3']) != ('@test_4060_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) First differing element 1:
+ ('@test_4060_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_os.py", line 874, in test_walk_prune
self.assertEqual(all[1], self.sub2_tree)
AssertionError: Tuples differ: ('@test_4060_tmp\\TEST1\\SUB2', ['broken_link', 'link'], ['tmp3']) != ('@test_4060_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) First differing element 1:
+ ('@test_4060_tmp\\TEST1\\SUB2', ['link'], ['broken_link', 'tmp3']) ---------------------------------------------------------------------- In some cases a broken link is considered as a link to regular file, but in other cases is considered as a link to directory. It is hard to fix this Windows issue without Windows. Needed a help of Windows experts. |
Windows buildbots still fail (sometimes?) because of this issue :-/ http://buildbot.python.org/all/builders/AMD64%20Windows8%203.x/builds/1874/steps/test/logs/stdio |
New changeset d54ee39b061f by Victor Stinner in branch 'default': |
New changeset eb91d0387d59 by Victor Stinner in branch 'default': |
I faile to reproduce the bug by running test_os alone on Windows 8. I wrote a script to try to isolate a sequential list of tests which reproduce the bug on Windows, but I failed to find such list after 2 hours: So I'm unable to reproduce the bug, I tried a change to enhance os._DummyDirEntry, it should now mimick better the C implementation of os.DirEntry. Let's see if the test still fails randomly... |
Update: test_os pass on all Windows buildbots except one! I don't understand why the test fails on this buildbot. http://buildbot.python.org/all/builders/AMD64%20Windows8%203.x/builds/1906/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_os.py", line 895, in test_walk_bottom_up
self.assertEqual(len(all), 4)
AssertionError: 5 != 4 |
New changeset 82da02b5bf22 by Victor Stinner in branch 'default': |
Some tests were failed on Windows 8 even in 3.4 (bpo-23808). |
New changeset 2b25fa7e3b7a by Victor Stinner in branch 'default': |
I'm including some comments here from an email thread I had with Victor about this issue on the Win8 buildbot, which led to his recent changeset 2b25fa7e3b7a. The Win8 3.x failure (the 5 != 4 length error) was due to the revision of os._DummyDirEntry (introduced I believe in eb91d0387d59) using stat() rather than lstat() to check for a symbolic link, which couldn't work, which in turn failed to exclude the symbolic directory link from the walk in the test. The 3.5 branch failure about the mis-matched tuple values from the walk is a separate issue, where the broken directory link is still counted as a directory (rather than a file). It seems due to os.path.isdir() - in the older os._DummyDirEntry implementation - returning True for the broken link in the test, rather than False as on Unix. That's because on Unix the underlying os.stat call fails, but on Windows isdir is replaced with nt._isdir which avoids os.stat but is just a shim for. The success on Windows sort of makes sense since the target_is_directory parameter to os.symlink() was used to create the broken link in the test, so Windows knows it's a directory, even if the link target is missing. If the bytes walk implementation needs to treat broken links like files on Windows (matching the non-bytes version), then it can't depend on isdir() failing. The non-bytes version (using scandir()) works even on Windows since internally scandir appears to build up mode bits out of os.stat/os.lstat calls and never uses isdir. Back-porting the 3.x implementation of DummyDirEntry would be one quick way to fix the 3.5 issue as it also avoids isdir(). BTW, with respect to changeset 2b25fa7e3b7a, I'm not sure it has quite the right semantics for is_dir, at least not if it's supposed to parallel os.path.isdir(). I believe that should return True for a symbolic link to a directory, which it doesn't look like this change would, if is_symlink happened to have been called first. It's possible the semantics of how _DummyDirEntry is used precludes that scenario, but it seems a little fragile. I'd probably just use lstat() for is_symlink, but otherwise not touch is_dir. Finally, as to why this only showed up on the Win8 buildbot - it turns out that's the only machine where the tests could create a symbolic link and thus encounter the scenario - they were prevented on Win7 and Win10 under the normal buildbot user due to lack of privileges (which was a surprise to me). That's also what was happening on Victor's local Win8 VM. So the failing conditions were just being skipped. The Win8 buildbot was itself sort of pure luck, as early on I must have set it up to run under an elevated (administrative) command prompt, probably due to trying to handle some other issue. I've now done the same thing for Win10 so it began seeing the same failures in the last few tests. |
New changeset c38ac7ab8d9a by Victor Stinner in branch '3.5': |
I think that you misunderstood the code. The "use cache lstat" path is only taken if the file is *not* symbolic link. I tested manually _DummyDirEntry on a symbolic link to a directory: _lstat is filled by the constructor, is_dir() *and* is_symlink() returns True. Both are not exclusive, is_dir() works as follow_symlinks=True, whereas is_symlink() works as follow_symlinks=False. |
Since all test_os now pass again on 3.x buildbots (after a few months of red buildbots), I backported the _DummyDirEntry fixes to 3.5. David:
I'm sorry, I don't understand you. The test_os failure was the same on Python 3.5 and 3.6:
This error should now be fixed on Python 3.5 and 3.6. Is there a remaining open issue related to os.scandir() (bytes or Unicode)? |
The issue looks to be fixed in Python 3.5 (future version 3.5.2) and 3.6, I close the issue. I repeat myself: On Python 3, don't store filenames as bytes, it's a bad practice. Use unicode, it works on all platforms and gives access to the full Unicode Character Set! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: