classification
Title: Improve tools for iterating over filesystem directories
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, giampaolo.rodola, ncoghlan, neologix, pitrou, vstinner
Priority: normal Keywords:

Created on 2011-10-20 02:05 by ncoghlan, last changed 2013-03-06 20:01 by gregory.p.smith.

Messages (19)
msg145999 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-10-20 02:05
I needed a depth-limited, filtered search of a directory tree recently and came up with the following wrapper around os.walk that brings in a few niceties like glob-style filtering, depth limiting and symlink traversal that is safe from infinite loops. It also emits a named tuple rather than the bare tuple emitted by os.walk:

http://code.activestate.com/recipes/577913-selective-directory-walking/

I think this would make a nice addition to 3.3 as shutil.filter_walk, but it need tests, documentation and reformatting as a patch before it can go anywhere.
msg147093 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-05 16:07
During the discussion about adding a chowntree function to shutil (http://mail.python.org/pipermail/python-dev/2011-May/111661.html and ), Victor suggested this:

> I don't like the idea of a recursive flag. I would prefer a "map-like" function to "apply" a
> function on all files of a directory. Something like shutil.apply_recursive(shutil.chown)...
> ... maybe with options to choose between deep-first search and breadth-first search, filter
> (filenames, file size, files only, directories only, other attributes?), directory before
> files (may be need for chmod(0o000)), etc.

Such a function removes the need for copytee, rmtree, dosomethingelsetree, etc.  When I first read this feature request I thought it was the same idea, but after reading it again I see that you propose a function that walks and returns, not a function that walks and applies a callback.  Both use cases are important, so an apply_tree function could be implemented on top of your filter_walk (I’ll open another report if we reach agreement on the idea of these two functions working together).
msg147094 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-05 16:08
s/and /and #13033/
msg147128 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-06 00:43
I should probably update that posted recipe to my latest version (which adds "excluded_files" and "excluded_dirs" parameters).

However, since I've been dealing with remote filesystems where os.listdir() and os.stat() calls from the local machine aren't possible lately, I also think we may need to reconsider how this is structured and look at the idea of building a more effective pipeline model that permits more efficient modes of interaction.

Let's take 'os.walk' as the base primitive - the basis of the pipeline will always be an iterator that produces 3-tuples of a base name, a list of subdirectories and a list of files. The filtering pipeline elements will require that the underlying walk include "topdown=True" and pay attention to changes in the subdirectory list.


Then consider the following possible pipeline elements:

def filter_dirs(walk_iter, *include_filters, exclude_filters=()):
    def should_include(dirname):
       return any(fnmatch(dirname, include) for include in include_filters)
    def should_exclude(dirname):
       return any(fnmatch(dirname, include) for exclude in exclude_filters)
    for dirpath, subdirs, files in walk_iter:
        subdirs[:] = [subdir for subdir in subdirs
                         if should_include(subdir) and not should_exclude(subdir)]
        yield dirpath, subdirs, files

def filter_files(walk_iter, *include_filters, exclude_filters=()):
    def should_include(dirname):
       return any(fnmatch(dirname, include) for include in include_filters)
    def should_exclude(dirname):
       return any(fnmatch(dirname, include) for exclude in exclude_filters)
    for dirpath, subdirs, files in walk_iter:
        files[:] = [fname for fname in files
                         if should_include(fname) and not should_exclude(fname)]
        yield dirpath, subdirs, files

def limit_depth(walk_iter, depth):
    if depth < 0:
        msg = "Depth limit greater than 0 ({!r} provided)"
        raise ValueError(msg.format(depth))
    sep = os.sep
    for top, subdirs, files in walk_iter:
        yield top, subdirs, files
        initial_depth = top.count(sep)
        if depth == 0:
            subdirs[:] = []
        break
    for dirpath, subdirs, files in walk_iter:
        yield dirpath, subdirs, files
        current_depth = dirpath.count(sep) - initial_depth
        if current_depth >= depth:
            subdirs[:] = []

def detect_symlink_loops(walk_iter, onloop=None):
    if onloop is None:
        def onloop(path):
            msg = "Symlink {!r} refers to a parent directory, skipping\n"
            sys.stderr.write(msg.format(path))
            sys.stderr.flush()
    for top, subdirs, files in walk_iter:
        yield top, subdirs, files
        real_top = os.path.abspath(os.path.realpath(top))
        break
    for dirpath, subdirs, files in walk_iter:
        if os.path.islink(dirpath):
            # We just descended into a directory via a symbolic link
            # Check if we're referring to a directory that is
            # a parent of our nominal directory
            relative = os.path.relpath(dirpath, top)
            nominal_path = os.path.join(real_top, relative)
            real_path = os.path.abspath(os.path.realpath(dirpath))
            path_fragments = zip(nominal_path.split(sep), real_path.split(sep))
            for nominal, real in path_fragments:
                if nominal != real:
                    break
            else:
                if not onloop(dirpath):
                    subdirs[:] = []
                    continue
        yield dirpath, subdirs, files

And pipeline terminators:

def walk_dirs(walk_iter):
    for dirpath, subdirs, files in walk_iter:
        yield dirpath

def walk_files(walk_iter):
    for dirpath, subdirs, files in walk_iter:
        for fname in files:
            yield os.path.join(dirpath, fname)

def walk_all(walk_iter):
    for dirpath, subdirs, files in walk_iter:
        yield dirpath
        for fname in files:
            yield os.path.join(dirpath, fname)

The pipeline terminators could then be combined with ordinary iterable consumers like comprehensions:

    base_walk = detect_symlink_loops(os.walk(os.path.abspath(base_dir, followlinks=True)))
    depth_limited_walk = limit_depth(base_walk, 2)
    filtered_walk = filter_dirs(filter_files(depth_limited_walk, "*.py"), "*.pyp")
    tree_info = {path, os.stat(path) for path in walk_all(filtered_walk)}
msg147129 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-06 00:46
This needs more thought - pypi package coming soon :)
msg147131 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-06 01:11
Nick, perhaps you want to have a look at http://hg.python.org/features/pathlib/
(it doesn't have a filter_walk equivalent but it could grow one :-))
msg147132 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-06 01:25
That's one of the nicer attempts I've seen at an object-oriented path library, but I have a core problem with OOP path APIs, and it relates to the Unicode encoding/decoding problem: the ultimate purpose of path objects is almost always to either pass them to the OS, or else to another application. That exchange will almost *always* happen as a plain string.

So when your approach is based on a more sophisticated type internally, you end up having to be very careful about all of your system boundaries, making sure that "paths" are correctly being turned into "Paths".

However, one of my hopes for iterwalk will be that it *won't* care if the underlying walk iterator produces Path objects instead of ordinary strings, so long as those objects can be passed to fnmatch, os, etc and work correctly.
msg147135 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-06 03:01
Initial version available at: https://bitbucket.org/ncoghlan/iterwalk/src

I'll get it published to PyPI once the test suite is in a slightly better state (no filesystem based tests as yet).
msg147335 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-09 07:24
A related idea is updating os.walk tuples to be a custom object offering a namedtuple style API (similar to sys.float_info).

I created a separate issue for that: #13375

I've also updated the title of this issue to describe the problem it aims to address rather than the specific solution that first occurred to me.
msg147389 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-10 02:40
I changed the package name to walkdir: https://bitbucket.org/ncoghlan/walkdir/overview
msg147580 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2011-11-13 23:27
And walkdir is now a published package: http://walkdir.readthedocs.org

My plan for this issue now is to maintain walkdir as a standalone package for 2.7 and 3.2, but still add the functionality to shutil for 3.3+.

However, I'll gather feedback on the PyPI module for a while before making any changes to the 3.3 repo.
msg152379 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-01-31 12:53
With the release of 0.3, I'm pretty happy with the WalkDir design (previous versions coerced the output values to ordinary 3-tuples, now it will pass along whatever the underlying iterable produces with changing the type at all).
msg152381 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-31 12:57
(speaking of which, I'd just mention I've published pathlib as a standalone package: http://pypi.python.org/pypi/pathlib/ )
msg153294 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-13 20:12
Not sure this is the right place to mention this, but I'm realizing handling symlink loops will be interesting. It is "interesting" because, when e.g. you are globbing, your glob's results may not include the symlinks' target path; so there are cases where you need to keep the symlink in the results.
(and in the general case you may or may not want to keep some duplicates, depending on what you do with the links)

I am adding rglob() to pathlib and currently ignoring (i.e. crashing :-)) symlink loops. I would be interested in any ideas as to how handle them correctly.
msg153295 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-13 20:16
Oh, forgot to mention, the term "symlink loop" itself is ambiguous.

There are direct symlink loops: an example is a "some_dir/linky" link pointing to "../some_dir/linky". These will fail when resolving them.

There are indirect symlink loops: "some_dir/linky" pointing to "../some_dir". Resolving them works fine, but recursively walking them produces an infinite recursion.

Lots of fun :)
msg153316 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-02-14 03:22
WalkDir attempts to handle symlink loops, but the test suite doesn't currently ensure that that handling works as expected (I did some initial manual tests and haven't updated it since, though).

It's... not trivial: https://bitbucket.org/ncoghlan/walkdir/src/1b239056e9d5/walkdir.py#cl-163
msg179290 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-01-07 22:35
Nick, I think this would be a great addition (I have often seen people trying to implement this in their own code, and I certainly did it myself).
What's the status of walkdir, do you think it's mature enough to be merged?
msg179293 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-01-07 22:44
pathlib and walkdir are two nice piece of code, but do we need two modules? It would be nice to merge them into one unique module :-) (Or is it a stupid idea?)
msg179303 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-01-08 01:55
The problem with the current walkdir implementation is that without a rich path object you end up making a lot of redundant system calls. Combined with the inherent overhead of nested generators, it just *feels* bad (even in situations where worrying about the speed is truly a case of premature optimisation)

The caching in pathlib should deal with the problem of redundant system calls, so rebasing walkdir on top of that would probably be a good idea (i.e. all walkdir APIs would produce pathlib paths, and implicit convert strings they encounter to paths).
History
Date User Action Args
2013-03-06 20:01:03gregory.p.smithsetversions: + Python 3.4, - Python 3.3
2013-01-08 01:55:56ncoghlansetmessages: + msg179303
2013-01-07 22:44:06vstinnersetmessages: + msg179293
2013-01-07 22:35:13neologixsetmessages: + msg179290
2012-02-14 03:22:09ncoghlansetmessages: + msg153316
2012-02-13 20:16:35pitrousetmessages: + msg153295
2012-02-13 20:12:57pitrousetnosy: + neologix
messages: + msg153294
2012-01-31 12:57:51pitrousetmessages: + msg152381
2012-01-31 12:53:57ncoghlansetmessages: + msg152379
2011-12-16 12:44:35giampaolo.rodolasetnosy: + giampaolo.rodola
2011-11-13 23:27:13ncoghlansetmessages: + msg147580
2011-11-10 02:40:33ncoghlansetmessages: + msg147389
2011-11-09 07:24:24ncoghlansetmessages: + msg147335
title: Add shutil.filter_walk -> Improve tools for iterating over filesystem directories
2011-11-06 03:01:11ncoghlansetmessages: + msg147135
2011-11-06 01:25:02ncoghlansetmessages: + msg147132
2011-11-06 01:11:59pitrousetnosy: + pitrou
messages: + msg147131
2011-11-06 00:46:21ncoghlansetmessages: + msg147129
2011-11-06 00:43:33ncoghlansetmessages: + msg147128
2011-11-05 16:08:22eric.araujosetmessages: + msg147094
2011-11-05 16:07:58eric.araujosetnosy: + vstinner
messages: + msg147093
2011-10-21 13:52:40eric.araujosetnosy: + eric.araujo
2011-10-20 02:05:30ncoghlancreate