This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: readdir() in os.listdir not threadsafe on OSX 10.6.8
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 3.3, Python 3.4, Python 2.7
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, ned.deily, neologix, rosslagerwall, thouis
Priority: normal Keywords: patch

Created on 2011-12-01 20:53 by thouis, last changed 2022-04-11 14:57 by admin. This issue is now closed.

File name Uploaded Description Edit
py272_readdir_r.patch thouis, 2011-12-01 20:53 thouis, 2011-12-01 20:56 script for demonstrating bug
py272_readdir_r.v2.patch thouis, 2011-12-01 22:28 updated patch (relative to hg 2.7 tip) review
Messages (10)
msg148737 - (view) Author: Thouis (Ray) Jones (thouis) Date: 2011-12-01 20:53
On my system (OSX 10.6.8) using the 32/64-bit build of 2.7.2, I see incorrect results from os.listdir() in a threaded program.  The error is that the result of os.listdir() is missing a few files from its list.

First, my use case.  I work with large image-based datasets, often with hundreds of thousands of images.  The first step in processing is to locate all of these images and extract some basic information (size, channels, etc.).  To do this more efficiently on network filesystems, where listing directories and stat()ing files is often slow, I wrote a multithreaded analog to os.walk().  While validating its results against unix 'find', I saw discrepancies in the number of files found.

My guess is that OSX's readdir() is not reentrant when dealing with SMB shares, even on different DIR pointers.  It's also possible that readdir() is not reentrant with lstat(), as some of my tests seemed to indicate this, but I need to run some more tests to be sure that's what I was actually seeing.

In any case, there are three possible ways to fix this, I think.

- Remove the Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS around readdir() in posixmodule.c

- Put a mutex on readdir()

- Use readdir_r().  I've attached a potential patch for 2.7.2 for this solution.

I would prefer the second or last approach, as they preserve the ability to do other work while listing large directories.

By my reading of the python 3.0 to 3.4 sources, this problem exists in those versions, as well.
msg148738 - (view) Author: Thouis (Ray) Jones (thouis) Date: 2011-12-01 20:56
Here is the script I use to detect the failure.

(note that I was working over samba with an 8ish-level deep directory with around 250000 files).

Compare its final output in the FOUND column with 
msg148739 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-12-01 21:35
The link mentioned in the patch is really interesting:
msg148740 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2011-12-01 21:54
Is there any reason to believe that the problem is confined to OS X?
msg148741 - (view) Author: Thouis (Ray) Jones (thouis) Date: 2011-12-01 22:03
I should add the caveat that I am not completely confident that I have stress-tested the patch enough to be sure that it actually addresses the problem.  It is still possible that this is an error in OSX or the remote fileserver in which a large amount of concurrent traffic is causing it to actually return invalid data.  This is somewhat belied by the fact that I was running 'find' at the same time, and did not see it give variable answers, ever. I will continue testing.
msg148742 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-01 22:31
> Is there any reason to believe that the problem is confined to OS X?

It's a bit of a grey area.
Here's what POSIX says:

The pointer returned by readdir() points to data which may be overwritten by another call to readdir() on the same directory stream. This data is not overwritten by another call to readdir() on a different directory stream.

So it seems safe as long as the threads are using distinct DIR *.
However, the documentation also says this:
The readdir() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe.

So in theory, readddir() could use some static/global state which may make it not thread-safe.

I just had a look at glibc's implementation, and it is indeed safe:;a=blob;f=sysdeps/unix/readdir.c;h=13e5e9a0213fcf37d5f289483439bff701a9708a;hb=HEAD

Every "sane" implementation should be safe in practice.

Now, it wouldn't be the first time we encounter such a stupid bug on OS X, but it would be nice to have a a short reproducer code in C to make sure.

> It's also possible that readdir() is not reentrant with lstat()

This doesn't make much sense to me.
msg148743 - (view) Author: Thouis (Ray) Jones (thouis) Date: 2011-12-01 22:32
Reading through many pages discussing readdir vs. readdir_r (many on security mailing lists, a large number referring to the page linked in the patch), I get the impression that most implementations are thread-safe as long as separate threads do not call readdir() using the same DIR pointer.

I believe there is some ambiguity in the POSIX specification as to whether this is the only way in which readdir() might be thread-unsafe.
msg148744 - (view) Author: Thouis (Ray) Jones (thouis) Date: 2011-12-01 22:38
> > It's also possible that readdir() is not reentrant with lstat()
> This doesn't make much sense to me.

Me either.  I think what I was actually seeing was multiple calls to readdir() still occurring even after placing a mutex on os.listdir due to my wrapping of os.listdir in a timeout via a subthread, and mutexing the timeout-wrapped version.  I will test this more carefully tomorrow.

I will also look into creating some C code to demonstrate the bug.
msg148745 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-01 22:41
And here's a post by Ulrich Drepper:

readdir_r is only needed if multiple threads are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.

So I'm even more confident that we should keep the current code.
At least, before going any further (and complicating the code), I'd like to know whether this can be reproduced with a small C snippet.
msg148753 - (view) Author: Thouis (Ray) Jones (thouis) Date: 2011-12-02 11:53
Further testing indicates the problem is in the filesystem itself (either the server or client, but not in python).

Serializing the loops calling readdir / readdir_r fixes the problem on my system, but using either function in a large number of parallel threads causes some directory entries to be missed (usually 2 entries in a row, oddly enough).  I was also able to cause 'find' to fail in the same way by placing the filesystem under sufficient stress, which I hadn't managed to do before (leading me to trust the filesystem more than I should have).

I apologize for the noise.  I've closed this bug report.
Date User Action Args
2022-04-11 14:57:24adminsetgithub: 57726
2011-12-02 11:53:27thouissetstatus: open -> closed

messages: + msg148753
2011-12-02 03:29:09rosslagerwallsetnosy: + rosslagerwall
2011-12-01 22:41:54neologixsetmessages: + msg148745
2011-12-01 22:38:05thouissetmessages: + msg148744
2011-12-01 22:32:30thouissetmessages: + msg148743
2011-12-01 22:31:20neologixsetmessages: + msg148742
2011-12-01 22:28:58thouissetfiles: + py272_readdir_r.v2.patch
2011-12-01 22:03:25thouissetmessages: + msg148741
2011-12-01 21:54:01ned.deilysetnosy: + neologix, ned.deily
messages: + msg148740
2011-12-01 21:35:11amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg148739
2011-12-01 20:56:50thouissetfiles: +

messages: + msg148738
2011-12-01 20:53:51thouiscreate