readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

thouis · 2011-12-01T20:53:52Z

BPO	13517
Nosy	@amauryfa, @ned-deily
Files	py272_readdir_r.patch filefinder.py: script for demonstrating bug py272_readdir_r.v2.patch: updated patch (relative to hg 2.7 tip)

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-12-02.11:53:27.158>
created_at = <Date 2011-12-01.20:53:51.783>
labels = ['type-bug', 'library']
title = 'readdir() in os.listdir not threadsafe on OSX 10.6.8'
updated_at = <Date 2011-12-02.11:53:27.157>
user = 'https://bugs.python.org/thouis'

bugs.python.org fields:

activity = <Date 2011-12-02.11:53:27.157>
actor = 'thouis'
assignee = 'none'
closed = True
closed_date = <Date 2011-12-02.11:53:27.158>
closer = 'thouis'
components = ['Library (Lib)']
creation = <Date 2011-12-01.20:53:51.783>
creator = 'thouis'
dependencies = []
files = ['23832', '23833', '23834']
hgrepos = []
issue_num = 13517
keywords = ['patch']
message_count = 10.0
messages = ['148737', '148738', '148739', '148740', '148741', '148742', '148743', '148744', '148745', '148753']
nosy_count = 5.0
nosy_names = ['amaury.forgeotdarc', 'ned.deily', 'neologix', 'thouis', 'rosslagerwall']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue13517'
versions = ['Python 3.1', 'Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

thouis · 2011-12-01T20:53:50Z

On my system (OSX 10.6.8) using the python.org 32/64-bit build of 2.7.2, I see incorrect results from os.listdir() in a threaded program. The error is that the result of os.listdir() is missing a few files from its list.

First, my use case. I work with large image-based datasets, often with hundreds of thousands of images. The first step in processing is to locate all of these images and extract some basic information (size, channels, etc.). To do this more efficiently on network filesystems, where listing directories and stat()ing files is often slow, I wrote a multithreaded analog to os.walk(). While validating its results against unix 'find', I saw discrepancies in the number of files found.

My guess is that OSX's readdir() is not reentrant when dealing with SMB shares, even on different DIR pointers. It's also possible that readdir() is not reentrant with lstat(), as some of my tests seemed to indicate this, but I need to run some more tests to be sure that's what I was actually seeing.

In any case, there are three possible ways to fix this, I think.

Remove the Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS around readdir() in posixmodule.c
Put a mutex on readdir()
Use readdir_r(). I've attached a potential patch for 2.7.2 for this solution.

I would prefer the second or last approach, as they preserve the ability to do other work while listing large directories.

By my reading of the python 3.0 to 3.4 sources, this problem exists in those versions, as well.

thouis · 2011-12-01T20:56:49Z

Here is the script I use to detect the failure.
% python filefinder.py /PATH/TO/LARGE/DIRECTORY/TREE

(note that I was working over samba with an 8ish-level deep directory with around 250000 files).

Compare its final output in the FOUND column with
% find /PATH/TO/LARGE/DIRECTORY/TREE | wc -l

amauryfa · 2011-12-01T21:35:11Z

The link mentioned in the patch is really interesting:
http://womble.decadent.org.uk/readdir_r-advisory.html

ned-deily · 2011-12-01T21:54:02Z

Is there any reason to believe that the problem is confined to OS X?

thouis · 2011-12-01T22:03:25Z

I should add the caveat that I am not completely confident that I have stress-tested the patch enough to be sure that it actually addresses the problem. It is still possible that this is an error in OSX or the remote fileserver in which a large amount of concurrent traffic is causing it to actually return invalid data. This is somewhat belied by the fact that I was running 'find' at the same time, and did not see it give variable answers, ever. I will continue testing.

neologix · 2011-12-01T22:31:21Z

Is there any reason to believe that the problem is confined to OS X?

It's a bit of a grey area.
Here's what POSIX says:

http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html

"""
The pointer returned by readdir() points to data which may be overwritten by another call to readdir() on the same directory stream. This data is not overwritten by another call to readdir() on a different directory stream.
"""

So it seems safe as long as the threads are using distinct DIR *.
However, the documentation also says this:
"""
The readdir() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe.
"""

So in theory, readddir() could use some static/global state which may make it not thread-safe.

I just had a look at glibc's implementation, and it is indeed safe:
http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/readdir.c;h=13e5e9a0213fcf37d5f289483439bff701a9708a;hb=HEAD

Every "sane" implementation should be safe in practice.

Now, it wouldn't be the first time we encounter such a stupid bug on OS X, but it would be nice to have a a short reproducer code in C to make sure.

It's also possible that readdir() is not reentrant with lstat()

This doesn't make much sense to me.

thouis · 2011-12-01T22:32:30Z

Reading through many pages discussing readdir vs. readdir_r (many on security mailing lists, a large number referring to the page linked in the patch), I get the impression that most implementations are thread-safe as long as separate threads do not call readdir() using the same DIR pointer.

I believe there is some ambiguity in the POSIX specification as to whether this is the only way in which readdir() might be thread-unsafe.

thouis · 2011-12-01T22:38:06Z

> It's also possible that readdir() is not reentrant with lstat()
This doesn't make much sense to me.

Me either. I think what I was actually seeing was multiple calls to readdir() still occurring even after placing a mutex on os.listdir due to my wrapping of os.listdir in a timeout via a subthread, and mutexing the timeout-wrapped version. I will test this more carefully tomorrow.

I will also look into creating some C code to demonstrate the bug.

neologix · 2011-12-01T22:41:54Z

And here's a post by Ulrich Drepper:
http://udrepper.livejournal.com/18555.html

"""
readdir_r is only needed if multiple threads are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.
"""

So I'm even more confident that we should keep the current code.
At least, before going any further (and complicating the code), I'd like to know whether this can be reproduced with a small C snippet.

thouis · 2011-12-02T11:53:26Z

Further testing indicates the problem is in the filesystem itself (either the server or client, but not in python).

Serializing the loops calling readdir / readdir_r fixes the problem on my system, but using either function in a large number of parallel threads causes some directory entries to be missed (usually 2 entries in a row, oddly enough). I was also able to cause 'find' to fail in the same way by placing the filesystem under sufficient stress, which I hadn't managed to do before (leading me to trust the filesystem more than I should have).

I apologize for the noise. I've closed this bug report.

thouis mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 1, 2011

thouis mannequin closed this as completed Dec 2, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

thouis mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

amauryfa commented Dec 1, 2011

ned-deily commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

neologix mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

neologix mannequin commented Dec 1, 2011

thouis mannequin commented Dec 2, 2011

readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

Comments

thouis mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

amauryfa commented Dec 1, 2011

ned-deily commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

neologix mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

thouis mannequin commented Dec 1, 2011

neologix mannequin commented Dec 1, 2011

thouis mannequin commented Dec 2, 2011