New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726
Comments
On my system (OSX 10.6.8) using the python.org 32/64-bit build of 2.7.2, I see incorrect results from os.listdir() in a threaded program. The error is that the result of os.listdir() is missing a few files from its list. First, my use case. I work with large image-based datasets, often with hundreds of thousands of images. The first step in processing is to locate all of these images and extract some basic information (size, channels, etc.). To do this more efficiently on network filesystems, where listing directories and stat()ing files is often slow, I wrote a multithreaded analog to os.walk(). While validating its results against unix 'find', I saw discrepancies in the number of files found. My guess is that OSX's readdir() is not reentrant when dealing with SMB shares, even on different DIR pointers. It's also possible that readdir() is not reentrant with lstat(), as some of my tests seemed to indicate this, but I need to run some more tests to be sure that's what I was actually seeing. In any case, there are three possible ways to fix this, I think.
I would prefer the second or last approach, as they preserve the ability to do other work while listing large directories. By my reading of the python 3.0 to 3.4 sources, this problem exists in those versions, as well. |
Here is the script I use to detect the failure. (note that I was working over samba with an 8ish-level deep directory with around 250000 files). Compare its final output in the FOUND column with |
The link mentioned in the patch is really interesting: |
Is there any reason to believe that the problem is confined to OS X? |
I should add the caveat that I am not completely confident that I have stress-tested the patch enough to be sure that it actually addresses the problem. It is still possible that this is an error in OSX or the remote fileserver in which a large amount of concurrent traffic is causing it to actually return invalid data. This is somewhat belied by the fact that I was running 'find' at the same time, and did not see it give variable answers, ever. I will continue testing. |
It's a bit of a grey area. http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html """ So it seems safe as long as the threads are using distinct DIR *. So in theory, readddir() could use some static/global state which may make it not thread-safe. I just had a look at glibc's implementation, and it is indeed safe: Every "sane" implementation should be safe in practice. Now, it wouldn't be the first time we encounter such a stupid bug on OS X, but it would be nice to have a a short reproducer code in C to make sure.
This doesn't make much sense to me. |
Reading through many pages discussing readdir vs. readdir_r (many on security mailing lists, a large number referring to the page linked in the patch), I get the impression that most implementations are thread-safe as long as separate threads do not call readdir() using the same DIR pointer. I believe there is some ambiguity in the POSIX specification as to whether this is the only way in which readdir() might be thread-unsafe. |
Me either. I think what I was actually seeing was multiple calls to readdir() still occurring even after placing a mutex on os.listdir due to my wrapping of os.listdir in a timeout via a subthread, and mutexing the timeout-wrapped version. I will test this more carefully tomorrow. I will also look into creating some C code to demonstrate the bug. |
And here's a post by Ulrich Drepper: """ So I'm even more confident that we should keep the current code. |
Further testing indicates the problem is in the filesystem itself (either the server or client, but not in python). Serializing the loops calling readdir / readdir_r fixes the problem on my system, but using either function in a large number of parallel threads causes some directory entries to be missed (usually 2 entries in a row, oddly enough). I was also able to cause 'find' to fail in the same way by placing the filesystem under sufficient stress, which I hadn't managed to do before (leading me to trust the filesystem more than I should have). I apologize for the noise. I've closed this bug report. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: