Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new os.scandir() function in os.walk() #67793

Closed
vstinner opened this issue Mar 8, 2015 · 34 comments
Closed

Use the new os.scandir() function in os.walk() #67793

vstinner opened this issue Mar 8, 2015 · 34 comments
Labels
performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented Mar 8, 2015

BPO 23605
Nosy @vstinner, @benhoyt, @serhiy-storchaka
Files
  • os_walk_1.patch: First cut at os.walk() using scandir
  • walk_added_symlink_to_dir.patch
  • walk_added_symlink_to_dir-2.patch
  • fast_bottom-up.patch
  • fast_bottom-up-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-03-30.12:48:56.877>
    created_at = <Date 2015-03-08.02:36:13.909>
    labels = ['performance']
    title = 'Use the new os.scandir() function in os.walk()'
    updated_at = <Date 2015-03-30.12:48:56.876>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-03-30.12:48:56.876>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-03-30.12:48:56.877>
    closer = 'vstinner'
    components = []
    creation = <Date 2015-03-08.02:36:13.909>
    creator = 'vstinner'
    dependencies = []
    files = ['38385', '38450', '38451', '38452', '38471']
    hgrepos = []
    issue_num = 23605
    keywords = ['patch']
    message_count = 34.0
    messages = ['237504', '237507', '237625', '237636', '237726', '237756', '237757', '237759', '237775', '237777', '237779', '237911', '237912', '237913', '237916', '237917', '237921', '237922', '237923', '237924', '238002', '238023', '238024', '238026', '238027', '238399', '238400', '239373', '239377', '239591', '239600', '239603', '239604', '239605']
    nosy_count = 5.0
    nosy_names = ['scott.dial', 'vstinner', 'benhoyt', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue23605'
    versions = ['Python 3.5']

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2015

    The PEP-471 announces a huge speed up of the os.walk() function, but os.walk() was not modified yet. I just merged the implementation of os.scandir() (issue bpo-22524), it's now time to work on os.walk().

    We need a patch and benchmarks on Linux and Windows.

    On NFS shares (on Linux), d_type is not filled (set to DT_UNKNOWN, so a syscall is required to get the file type). Benchmarks of os.walk() on a NFS share is also required.

    @vstinner vstinner added the performance Performance or resource usage label Mar 8, 2015
    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 8, 2015

    Attaching a first cut at this -- basically the implementation I use for walk() in scandir.py on GitHub.

    One thing that's really weird to me: are the os.walk() unit tests actually being run? In test_os.py, I notice everything's in WalkTest.setUp, which is kinda weird -- and it's not actually running. At least when I add a "1/0" inside setUp() to force an error and run "python -m test test_os" I don't get any errors...

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2015

    I reviewed your patch.

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 9, 2015

    Note: glob.glob() might be faster with os.scandir() on very large directories. But on my benchmarks, listdir() was always faster than scandir() when only the name of directory entries i used. Maybe we need an option glob.glob(pattern, scandir=True) :-p

    @scottdial
    Copy link
    Mannequin

    scottdial mannequin commented Mar 10, 2015

    I cloned https://github.com/benhoyt/scandir @ 494f34d784 and ran benchmark.py on a couple systems that are Linux backed by a couple different NFS servers of various quality.

    First, a Linux VM backed by a Mac OS X NFS server backed by a SSD:

    $ python benchmark.py
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk took 0.088s, scandir.walk took 0.084s -- 1.1x as fast
    $ python benchmark.py -s
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk size 23400, scandir.walk size 23400 -- equal
    os.walk took 0.142s, scandir.walk took 0.145s -- 1.0x as fast

    Second, a Linux VM backed by a Linux NFS server backed by a NAS with big, slow drives:

    $ python benchmark.py
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk took 0.071s, scandir.walk took 0.063s -- 1.1x as fast
    $ python benchmark.py -s
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk size 23400, scandir.walk size 23400 -- equal
    os.walk took 0.118s, scandir.walk took 0.141s -- 0.8x as fast

    Finally, a linux VM backed by a Linux NFS server backed by a NAS with small, fast SAS drives:

    $ python benchmark.py
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk took 0.159s, scandir.walk took 0.119s -- 1.3x as fast
    $ python benchmark.py -s
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk size 23400, scandir.walk size 23400 -- equal
    os.walk took 0.229s, scandir.walk took 0.232s -- 1.0x as fast

    A major factor that is not addressed above is that the performance is dramatically different if the metadata cache for the NFS mount is disabled, which is not the default. In the above data, the first system is normally configured in such a manner in order to ensure that the filesystem is coherent. The results of that test is much more dramatic:

    $ python benchmark.py
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk took 4.835s, scandir.walk took 0.097s -- 49.9x as fast
    $ python benchmark.py -s
    Using slower ctypes version of scandir
    Comparing against builtin version of os.walk()
    Priming the system's cache...
    Benchmarking walks on benchtree, repeat 1/3...
    Benchmarking walks on benchtree, repeat 2/3...
    Benchmarking walks on benchtree, repeat 3/3...
    os.walk size 23400, scandir.walk size 23400 -- equal
    os.walk took 9.945s, scandir.walk took 5.373s -- 1.9x as fast

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 10, 2015

    New changeset f805fdacdfe0 by Victor Stinner in branch 'default':
    Issue bpo-23605: os.walk() now calls os.scandir() instead of os.listdir().
    https://hg.python.org/cpython/rev/f805fdacdfe0

    @vstinner
    Copy link
    Member Author

    My suggestion to add a new walk_dirs list is wrong: os.walk() documentation explicitly says that the dirs list can be modified to delete some directories:
    https://docs.python.org/dev/library/os.html#os.walk
    """
    When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
    """

    os_walk_1.patch is inefficient: it also calls entry.is_symlink() for file entries.

    I reworked your patch to only call is_symlink() for directories.

    Thanks for the patch Ben. I think that we are now done with the PEP-471 no? Maybe some doc changes (I'm now reviewing your doc change in issue bpo-22524).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 10, 2015

    New changeset 50e32059a404 by Victor Stinner in branch '3.4':
    Issue bpo-23605: os.walk() doc now mentions shutil.rmtree() in the last example
    https://hg.python.org/cpython/rev/50e32059a404

    @serhiy-storchaka
    Copy link
    Member

    When followlinks is true, symlinks is not needed.

    But I this commit breaks a code like following:

    def unsymlink(top):
        for root, dirs, files in os.walk(top):
            for name in dirs:
                path = os.path.join(root, name)
                if os.path.islink(path):
                    target = os.path.join(root, os.readlink(path))
                    os.unlink(path)
                    shutil.copytree(target, path)
            for name in files:
                path = os.path.join(root, name)
                if os.path.islink(path):
                    target = os.path.join(root, os.readlink(path))
                    os.unlink(path)
                    shutil.copy2(target, path)

    @vstinner
    Copy link
    Member Author

    Serhiy Storchaka added the comment:

    When followlinks is true, symlinks is not needed.

    Hum, it's not easy to understand your code. I guess that the problem
    is that a symlink to a directory can become something else (not a
    directory or a symlink to a directory).

    I noticed this race condition in the new implementation of os.walk(),
    but I don't think that the issue is really new. The old implementation
    of os.walk() already uses a list of directories. The caller can remove
    a directory or replace a directory with something else.

    Since the bug is not documented in os.walk(), I chose to not document
    it neither. But it would be better to warn users to not do that :-)

    @serhiy: I agree that the new implementation changes the behaviour,
    but I don't consider it as as bug. Do you think that it's a bug?

    @serhiy-storchaka
    Copy link
    Member

    My first note was about efficiency of the implementation. When followlinks is
    true, you can avoid testing entry.is_link() and creating the symlinks set.

    The new implementation of os.walk() changes a behavior. The problem is that a
    symlink to a directory can become a directory. It is documented that the
    caller can remove or add a directory to directory list and for sure this
    feature is used in third-party code. In my example a directory list is even
    not changed, but file system is changed.

    I think that majority of uses of os.walk() will be not affected by this change,
    but for sure there is a code that will be broken and we can receive bug
    reports after releasing 3.6.

    New behavior also is not consistent with os.fwalk().

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 12, 2015

    @scott Dial: just a response about this benchmark: note that this benchmark isn't really valid, as it says "Using slower ctypes version of scandir", which is the slow all-Python version. You want it to be saying "Using Python 3.5's builtin os.scandir()".

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 12, 2015

    To Victor and Serhiy:

    1. Serhiy's point about not needing to build the symlinks set when followlinks is True is a good one, because it'll never get used. Just a "if not followlinks: ..." around that try/except would do it. Though this is a small optimization, as I expect 95% of people use os.walk() with followlinks=False, which is the default.

    2. Much as I don't want to admit it :-), I think Serhiy has a point regarding the change in behaviour. Can we accept this tiny change? This happens because the previous implementation of os.walk() calls path.islink(), which does a real syscall, after the yield returns.

    So if the caller modifies "dirnames" and adds a symlink to a directory it won't be in the symlinks set. Or if (as Serhiy's example shows) they change a symlink-to-a-directory to a directory the new implementation doesn't do another syscall to check, so differs from the old one -- in Serhiy's example, the old os.walk() will recurse into the changed symlink-to-a-directory-that's-now-a-directory, whereas the new os.walk() won't recurse.

    Arguably the new behaviour is just as good here, but it is different. And Serhiy's function unsymlink() is a reasonable example of how this might play out.

    The os.walk() docs explicitly allow modifying "dirnames" -- not just removing and reordering, but also adding entries, which could surface the difference in behaviour: "the caller can modify the dirnames list in-place ... even to inform walk() about directories the caller creates or renames before it resumes walk() again". See here:

    https://docs.python.org/3.4/library/os.html#os.walk

    So I do think it's a real issue. Will this really be an issue in practice? I don't know; I'm tempted to think not.

    Obviously if we have to call stat/islink again, it defeats half of the purpose of scandir. But I don't see a way around it if we want to keep the exact old os.walk() behaviour. We could fix the behaviour if a directory/symlink was added to "dirnames" fairly easily by testing whether it changed, but I don't see a way to fix the unsymlinks() example without a syscall.

    Thoughts?

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 12, 2015

    Note specifically in the unsymlink() example Serhiy gave, you probably don't *want* os.walk() to recurse into a symlink-to-a-directory-that's-changed-to-a-directory, and you probably haven't thought about it doing so, so maybe the new behaviour is fine?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 12, 2015

    New changeset 1a972140ab62 by Victor Stinner in branch 'default':
    Issue bpo-23605: os.walk() doesn't need to call entry.is_symlink() if followlinks
    https://hg.python.org/cpython/rev/1a972140ab62

    @vstinner
    Copy link
    Member Author

    1. Serhiy's point about not needing to build the symlinks set when followlinks is True is a good one, because it'll never get used. Just a "if not followlinks: ..." around that try/except would do it. Though this is a small optimization, as I expect 95% of people use os.walk() with followlinks=False, which is the default.

    No, it's not a "small optimization". I/O are expensive, especially disk I/O. Try os.walk() on a NFS share with the cache disabled ;-)

    My first note was about efficiency of the implementation. When followlinks is
    true, you can avoid testing entry.is_link() and creating the symlinks set.

    Oh, I misunderstood your comment. The changeset 1a972140ab62 avoids calling entry.is_symlink() when it's not needed. Thanks for the report.

    --

    walk_added_symlink_to_dir.patch: Small modification to os.walk() to keep the performances of os.scandir() *and* still supported adding symlinks to dirs (no backward compatibility changes).

    Happy? :-)

    I started to work on an unit test, but I don't understand how WalkTests and FwalkTests are written.

    The setUp() method contais unit tests?! For example, setUp() has a unit test removing a directory from the dirs list to check os.walk() behaviour.

    I will try harder to understand how these tests are written and post a patch for tests later.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 12, 2015

    New changeset c06ebb57d4ed by Victor Stinner in branch '3.4':
    Issue bpo-23605: Refactor os.walk() tests to also run them on os.fwalk()
    https://hg.python.org/cpython/rev/c06ebb57d4ed

    @vstinner
    Copy link
    Member Author

    walk_added_symlink_to_dir-2.patch: Ok, here is a new patch, now with tests.

    Without the fix on os.walk(), WalkTests.test_add_dir_symlink() fails, whereas FwalkTests.test_add_dir_symlink() pass.

    @serhiy-storchaka
    Copy link
    Member

    Happy? :-)

    No, it doesn't fix my example. :-(

    I see following possibilities:

    1. Partially revert the patch and always call path.islink(). I will not kill all the benefit of using os.scandir(), because there is a benefit from avoiding path.isdir() and in any case in large directories most content is usually files, not dirs.

    2. Accept and document behavior change. This will break someone's code. :-(

    3. Left os.walk as was (or only partially optimized as in 1), and add a new function with new behavior.

    4. Add new boolean parameter that control behavior to os.walk().

    5. Try to detect when dir list or filesystem were changed. Victor's patch does the first. For the second we can check if top directory inode and mtime were changed. But I afraid this wouldn't decrease a number of system calls.

    @vstinner
    Copy link
    Member Author

    And now something completly differenet: fast_bottom-up.patch, fast bottom-up, but "slow" top-down.

    In bottom-up mode (topdown=False), use entry.path and entry.is_symlink() => optimization compared to Python 3.4 (avoid one call to os.stat()).

    In top-down mode (topdown=True, default), use os.path.join() and os.path.islink() => no change compared to Python 3.4 on this part. More globally, os.walk() should still be much faster in Python 3.5 than in Python 3.5 thanks to entry.is_dir() which avoids a call to os.path.isdir() (and so os.stat()).

    Correctness matters more than performances!

    1. Accept and document behavior change. This will break someone's code. :-(
    2. Left os.walk as was (or only partially optimized as in 1), and add a new function with new behavior.
    3. Add new boolean parameter that control behavior to os.walk().

    Sorry but all these options sucks :-p

    1. Try to detect when dir list or filesystem were changed. Victor's patch does the first. For the second we can check if top directory inode and mtime were changed. But I afraid this wouldn't decrease a number of system calls.

    By experience, it's very hard to write portable code. Even if we find a way to detect FS change on some platforms, it will not work on some other platforms.

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 13, 2015

    Thanks, Victor.

    I haven't quite grokked all the changes here -- it's gotten somewhat more complicated with the scandir_it and manual next() call -- but I ran some benchmarks (via a hacked version of my scandir project's benchmark.py). The results were surprising, and in a good way:

    Dev version in hg (no extra islink syscall):
    --------------------------------------------
    Windows: 13.1x as fast (68.8x as fast in funky caching mode)
    Linux: 7.8x as fast

    With Victor's fast_bottom_up patch (100% correct behaviour):
    --------------------------------------------
    Windows: 9.4x as fast (50.2x as fast in funky caching mode)
    Linux: 6.5x as fast

    So os.walk() will still be 10x as fast on Windows if you apply this patch, and 6x as fast on my Linux VM. I haven't dug too deeply to know quite why the numbers are this good, especially on Linux, but that's what I'm seeing, which is great!

    @vstinner
    Copy link
    Member Author

    I don't understand your benchmark. Do you mean that os.walk() is slower with fast_bottom-up.patch because islink() is called or because I replaced "for entry in scandir(top):" with "entry = next(scandir_it)"?

    Are you testing the top-bottom or bottom-up?

    Here is a variant of my patch with "for entry in scandir(top):". I would prefer to avoid this variant with a boolean to not catch OSError on the recursive call to walk() if next() has similar performances.

    @serhiy-storchaka
    Copy link
    Member

    You fast_bottom-up looks awesome, but much more correct. This is what I meant when warned that correct implementations with scandir() will be complex.

    Could you please add a test based on my example (i.e. converting symlinks to a directory during walking) and may be other (creating new directory and adding it to the dirs list)?

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 13, 2015

    I don't understand your benchmark. Do you mean that os.walk() is slower
    with fast_bottom-up.patch because islink() is called or because I replaced
    "for entry in scandir(top):" with "entry = next(scandir_it)"?

    No, sorry, I was making two separate comments: 1) the code's gotten quite a bit more complex (and if it needs to be that way for correctness, I'm okay with that), and 2) I'm surprised at how fast it still is.

    Are you testing the top-bottom or bottom-up?

    My benchmark.py calls os.walk() with topdown=True, which is the default. I was testing the Python 3.4 version of os.walk() via listdir against your fast_bottom-up.patch.

    I'm keen to look into this a bit further, but it won't be today.

    @vstinner
    Copy link
    Member Author

    > Are you testing the top-bottom or bottom-up?
    My benchmark.py calls os.walk() with topdown=True, which is the default.

    Is it worth to mention in the os.walk() doc that topdown=False can be faster?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 18, 2015

    New changeset 4fb829f8c04d by Victor Stinner in branch 'default':
    Issue bpo-23605: Fix os.walk(topdown=True), don't cache entry.is_symlink() because
    https://hg.python.org/cpython/rev/4fb829f8c04d

    @vstinner
    Copy link
    Member Author

    I commited fast_bottom-up.patch to fix the regression of os.walk().

    Could you please add a test based on my example (i.e. converting symlinks to a directory during walking) and may be other (creating new directory and adding it to the dirs list)?

    Sorry, I'm not interested to write new tests.

    Is it worth to mention in the os.walk() doc that topdown=False can be faster?

    I didn't touch the doc yet.

    I keep the issue open a few weeks to see if someone is interested to optimize os.walk() further, write new tests, enhance the doc, etc.

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 27, 2015

    Victor, great work on pushing this out, especially with the "modifying the directories" fix. (And thanks Serhiy for pushing on correctness here.)

    Couple of comments/questions about your new os.walk() implementation.

    1. The new implementation is more complex. Of course, most of this is necessary due to the topdown directory issue. However, one thing I'm not sure about is why you create scandir_it manually and use a while True loop, adding complexity and making it require two versions of the error handling. Why not a simple "for entry in scandir(top): ..." with a try/except OSError around the whole loop? I could well be missing something here though.

    2. In this commit http://bugs.python.org/review/23605/diff/14181/Lib/os.py -- which is not the final one, I don't quite understand the catch_oserror thing. Presumably this turned out to be unnecessary, as it's not in the final version?

    3. Really minor thing: in one of the comments, you misspell "symbolik". Should be "symbolic".

    @vstinner
    Copy link
    Member Author

    Ben Hoyt added the comment:

    1. The new implementation is more complex. Of course, most of this is
      necessary due to the topdown directory issue.

    Sure, correctness matters more than performances.

    However, one thing I'm not sure about is why you create scandir_it
    manually and use a while True loop,

    The idea is to control lines where OSError is catched to call onerrror(),
    without breaking backward compatibility. Especially, onerror() should only
    be called once, even for recursive calls.

    1. In this commit http://bugs.python.org/review/23605/diff/14181/Lib/os.py
      -- which is not the final one, I don't quite understand the catch_oserror
      thing.
      It's just a try to write differently the same thing. It didn't convince
      myself that it's more readable, so I picked the first version.

    What's your point about complexity? Would you like to drop os.scandir
    changes in os.walk(), or do you have a simpler version to propose?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 30, 2015

    New changeset 1ad3d8d82b18 by Victor Stinner in branch 'default':
    Issue bpo-23605: Fix typo in an os.walk() comment
    https://hg.python.org/cpython/rev/1ad3d8d82b18

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 30, 2015

    Thanks for the explanation (and the comment fix).

    What's your point about complexity? Would you like to drop os.scandir
    changes in os.walk(), or do you have a simpler version to propose?

    No, not at all! I was just noting it and trying to brainstorm any ways to simplify it (while keeping the current behavior). But I'm not sure there is a good simplification, so that's fine.

    @vstinner
    Copy link
    Member Author

    Ben: do you think that we are done with this issue? can I close it?

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Mar 30, 2015

    Yep, I'm good -- go ahead and close!

    @vstinner
    Copy link
    Member Author

    Thanks for your great work Ben! I close the issue. The PEP-471 has already the Final status.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants