Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

Closed
thouis mannequin opened this issue Dec 1, 2011 · 10 comments
Closed

readdir() in os.listdir not threadsafe on OSX 10.6.8 #57726

thouis mannequin opened this issue Dec 1, 2011 · 10 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@thouis
Copy link
Mannequin

thouis mannequin commented Dec 1, 2011

BPO 13517
Nosy @amauryfa, @ned-deily
Files
  • py272_readdir_r.patch
  • filefinder.py: script for demonstrating bug
  • py272_readdir_r.v2.patch: updated patch (relative to hg 2.7 tip)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-12-02.11:53:27.158>
    created_at = <Date 2011-12-01.20:53:51.783>
    labels = ['type-bug', 'library']
    title = 'readdir() in os.listdir not threadsafe on OSX 10.6.8'
    updated_at = <Date 2011-12-02.11:53:27.157>
    user = 'https://bugs.python.org/thouis'

    bugs.python.org fields:

    activity = <Date 2011-12-02.11:53:27.157>
    actor = 'thouis'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-12-02.11:53:27.158>
    closer = 'thouis'
    components = ['Library (Lib)']
    creation = <Date 2011-12-01.20:53:51.783>
    creator = 'thouis'
    dependencies = []
    files = ['23832', '23833', '23834']
    hgrepos = []
    issue_num = 13517
    keywords = ['patch']
    message_count = 10.0
    messages = ['148737', '148738', '148739', '148740', '148741', '148742', '148743', '148744', '148745', '148753']
    nosy_count = 5.0
    nosy_names = ['amaury.forgeotdarc', 'ned.deily', 'neologix', 'thouis', 'rosslagerwall']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13517'
    versions = ['Python 3.1', 'Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

    @thouis
    Copy link
    Mannequin Author

    thouis mannequin commented Dec 1, 2011

    On my system (OSX 10.6.8) using the python.org 32/64-bit build of 2.7.2, I see incorrect results from os.listdir() in a threaded program. The error is that the result of os.listdir() is missing a few files from its list.

    First, my use case. I work with large image-based datasets, often with hundreds of thousands of images. The first step in processing is to locate all of these images and extract some basic information (size, channels, etc.). To do this more efficiently on network filesystems, where listing directories and stat()ing files is often slow, I wrote a multithreaded analog to os.walk(). While validating its results against unix 'find', I saw discrepancies in the number of files found.

    My guess is that OSX's readdir() is not reentrant when dealing with SMB shares, even on different DIR pointers. It's also possible that readdir() is not reentrant with lstat(), as some of my tests seemed to indicate this, but I need to run some more tests to be sure that's what I was actually seeing.

    In any case, there are three possible ways to fix this, I think.

    • Remove the Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS around readdir() in posixmodule.c

    • Put a mutex on readdir()

    • Use readdir_r(). I've attached a potential patch for 2.7.2 for this solution.

    I would prefer the second or last approach, as they preserve the ability to do other work while listing large directories.

    By my reading of the python 3.0 to 3.4 sources, this problem exists in those versions, as well.

    @thouis thouis mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 1, 2011
    @thouis
    Copy link
    Mannequin Author

    thouis mannequin commented Dec 1, 2011

    Here is the script I use to detect the failure.
    % python filefinder.py /PATH/TO/LARGE/DIRECTORY/TREE

    (note that I was working over samba with an 8ish-level deep directory with around 250000 files).

    Compare its final output in the FOUND column with
    % find /PATH/TO/LARGE/DIRECTORY/TREE | wc -l

    @amauryfa
    Copy link
    Member

    amauryfa commented Dec 1, 2011

    The link mentioned in the patch is really interesting:
    http://womble.decadent.org.uk/readdir_r-advisory.html

    @ned-deily
    Copy link
    Member

    Is there any reason to believe that the problem is confined to OS X?

    @thouis
    Copy link
    Mannequin Author

    thouis mannequin commented Dec 1, 2011

    I should add the caveat that I am not completely confident that I have stress-tested the patch enough to be sure that it actually addresses the problem. It is still possible that this is an error in OSX or the remote fileserver in which a large amount of concurrent traffic is causing it to actually return invalid data. This is somewhat belied by the fact that I was running 'find' at the same time, and did not see it give variable answers, ever. I will continue testing.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 1, 2011

    Is there any reason to believe that the problem is confined to OS X?

    It's a bit of a grey area.
    Here's what POSIX says:

    http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html

    """
    The pointer returned by readdir() points to data which may be overwritten by another call to readdir() on the same directory stream. This data is not overwritten by another call to readdir() on a different directory stream.
    """

    So it seems safe as long as the threads are using distinct DIR *.
    However, the documentation also says this:
    """
    The readdir() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe.
    """

    So in theory, readddir() could use some static/global state which may make it not thread-safe.

    I just had a look at glibc's implementation, and it is indeed safe:
    http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/readdir.c;h=13e5e9a0213fcf37d5f289483439bff701a9708a;hb=HEAD

    Every "sane" implementation should be safe in practice.

    Now, it wouldn't be the first time we encounter such a stupid bug on OS X, but it would be nice to have a a short reproducer code in C to make sure.

    It's also possible that readdir() is not reentrant with lstat()

    This doesn't make much sense to me.

    @thouis
    Copy link
    Mannequin Author

    thouis mannequin commented Dec 1, 2011

    Reading through many pages discussing readdir vs. readdir_r (many on security mailing lists, a large number referring to the page linked in the patch), I get the impression that most implementations are thread-safe as long as separate threads do not call readdir() using the same DIR pointer.

    I believe there is some ambiguity in the POSIX specification as to whether this is the only way in which readdir() might be thread-unsafe.

    @thouis
    Copy link
    Mannequin Author

    thouis mannequin commented Dec 1, 2011

    > It's also possible that readdir() is not reentrant with lstat()
    This doesn't make much sense to me.

    Me either. I think what I was actually seeing was multiple calls to readdir() still occurring even after placing a mutex on os.listdir due to my wrapping of os.listdir in a timeout via a subthread, and mutexing the timeout-wrapped version. I will test this more carefully tomorrow.

    I will also look into creating some C code to demonstrate the bug.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Dec 1, 2011

    And here's a post by Ulrich Drepper:
    http://udrepper.livejournal.com/18555.html

    """
    readdir_r is only needed if multiple threads are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.
    """

    So I'm even more confident that we should keep the current code.
    At least, before going any further (and complicating the code), I'd like to know whether this can be reproduced with a small C snippet.

    @thouis
    Copy link
    Mannequin Author

    thouis mannequin commented Dec 2, 2011

    Further testing indicates the problem is in the filesystem itself (either the server or client, but not in python).

    Serializing the loops calling readdir / readdir_r fixes the problem on my system, but using either function in a large number of parallel threads causes some directory entries to be missed (usually 2 entries in a row, oddly enough). I was also able to cause 'find' to fail in the same way by placing the filesystem under sufficient stress, which I hadn't managed to do before (leading me to trust the filesystem more than I should have).

    I apologize for the noise. I've closed this bug report.

    @thouis thouis mannequin closed this as completed Dec 2, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants