Issue 13229: Improve tools for iterating over filesystem directories

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57438

classification

Title:	Improve tools for iterating over filesystem directories
Type:	enhancement	Stage:	test needed
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	eric.araujo, giampaolo.rodola, ncoghlan, neologix, pitrou, vstinner
Priority:	normal	Keywords:

Created on 2011-10-20 02:05 by ncoghlan, last changed 2022-04-11 14:57 by admin.

Messages (19)
msg145999 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-10-20 02:05
I needed a depth-limited, filtered search of a directory tree recently and came up with the following wrapper around os.walk that brings in a few niceties like glob-style filtering, depth limiting and symlink traversal that is safe from infinite loops. It also emits a named tuple rather than the bare tuple emitted by os.walk: http://code.activestate.com/recipes/577913-selective-directory-walking/ I think this would make a nice addition to 3.3 as shutil.filter_walk, but it need tests, documentation and reformatting as a patch before it can go anywhere.
msg147093 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2011-11-05 16:07
During the discussion about adding a chowntree function to shutil (http://mail.python.org/pipermail/python-dev/2011-May/111661.html and ), Victor suggested this: > I don't like the idea of a recursive flag. I would prefer a "map-like" function to "apply" a > function on all files of a directory. Something like shutil.apply_recursive(shutil.chown)... > ... maybe with options to choose between deep-first search and breadth-first search, filter > (filenames, file size, files only, directories only, other attributes?), directory before > files (may be need for chmod(0o000)), etc. Such a function removes the need for copytee, rmtree, dosomethingelsetree, etc. When I first read this feature request I thought it was the same idea, but after reading it again I see that you propose a function that walks and returns, not a function that walks and applies a callback. Both use cases are important, so an apply_tree function could be implemented on top of your filter_walk (I’ll open another report if we reach agreement on the idea of these two functions working together).
msg147094 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2011-11-05 16:08
s/and /and #13033/
msg147128 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-06 00:43
I should probably update that posted recipe to my latest version (which adds "excluded_files" and "excluded_dirs" parameters). However, since I've been dealing with remote filesystems where os.listdir() and os.stat() calls from the local machine aren't possible lately, I also think we may need to reconsider how this is structured and look at the idea of building a more effective pipeline model that permits more efficient modes of interaction. Let's take 'os.walk' as the base primitive - the basis of the pipeline will always be an iterator that produces 3-tuples of a base name, a list of subdirectories and a list of files. The filtering pipeline elements will require that the underlying walk include "topdown=True" and pay attention to changes in the subdirectory list. Then consider the following possible pipeline elements: def filter_dirs(walk_iter, include_filters, exclude_filters=()): def should_include(dirname): return any(fnmatch(dirname, include) for include in include_filters) def should_exclude(dirname): return any(fnmatch(dirname, include) for exclude in exclude_filters) for dirpath, subdirs, files in walk_iter: subdirs[:] = [subdir for subdir in subdirs if should_include(subdir) and not should_exclude(subdir)] yield dirpath, subdirs, files def filter_files(walk_iter, include_filters, exclude_filters=()): def should_include(dirname): return any(fnmatch(dirname, include) for include in include_filters) def should_exclude(dirname): return any(fnmatch(dirname, include) for exclude in exclude_filters) for dirpath, subdirs, files in walk_iter: files[:] = [fname for fname in files if should_include(fname) and not should_exclude(fname)] yield dirpath, subdirs, files def limit_depth(walk_iter, depth): if depth < 0: msg = "Depth limit greater than 0 ({!r} provided)" raise ValueError(msg.format(depth)) sep = os.sep for top, subdirs, files in walk_iter: yield top, subdirs, files initial_depth = top.count(sep) if depth == 0: subdirs[:] = [] break for dirpath, subdirs, files in walk_iter: yield dirpath, subdirs, files current_depth = dirpath.count(sep) - initial_depth if current_depth >= depth: subdirs[:] = [] def detect_symlink_loops(walk_iter, onloop=None): if onloop is None: def onloop(path): msg = "Symlink {!r} refers to a parent directory, skipping\n" sys.stderr.write(msg.format(path)) sys.stderr.flush() for top, subdirs, files in walk_iter: yield top, subdirs, files real_top = os.path.abspath(os.path.realpath(top)) break for dirpath, subdirs, files in walk_iter: if os.path.islink(dirpath): # We just descended into a directory via a symbolic link # Check if we're referring to a directory that is # a parent of our nominal directory relative = os.path.relpath(dirpath, top) nominal_path = os.path.join(real_top, relative) real_path = os.path.abspath(os.path.realpath(dirpath)) path_fragments = zip(nominal_path.split(sep), real_path.split(sep)) for nominal, real in path_fragments: if nominal != real: break else: if not onloop(dirpath): subdirs[:] = [] continue yield dirpath, subdirs, files And pipeline terminators: def walk_dirs(walk_iter): for dirpath, subdirs, files in walk_iter: yield dirpath def walk_files(walk_iter): for dirpath, subdirs, files in walk_iter: for fname in files: yield os.path.join(dirpath, fname) def walk_all(walk_iter): for dirpath, subdirs, files in walk_iter: yield dirpath for fname in files: yield os.path.join(dirpath, fname) The pipeline terminators could then be combined with ordinary iterable consumers like comprehensions: base_walk = detect_symlink_loops(os.walk(os.path.abspath(base_dir, followlinks=True))) depth_limited_walk = limit_depth(base_walk, 2) filtered_walk = filter_dirs(filter_files(depth_limited_walk, ".py"), ".pyp") tree_info = {path, os.stat(path) for path in walk_all(filtered_walk)}
msg147129 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-06 00:46
This needs more thought - pypi package coming soon :)
msg147131 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-11-06 01:11
Nick, perhaps you want to have a look at http://hg.python.org/features/pathlib/ (it doesn't have a filter_walk equivalent but it could grow one :-))
msg147132 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-06 01:25
That's one of the nicer attempts I've seen at an object-oriented path library, but I have a core problem with OOP path APIs, and it relates to the Unicode encoding/decoding problem: the ultimate purpose of path objects is almost always to either pass them to the OS, or else to another application. That exchange will almost always happen as a plain string. So when your approach is based on a more sophisticated type internally, you end up having to be very careful about all of your system boundaries, making sure that "paths" are correctly being turned into "Paths". However, one of my hopes for iterwalk will be that it won't care if the underlying walk iterator produces Path objects instead of ordinary strings, so long as those objects can be passed to fnmatch, os, etc and work correctly.
msg147135 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-06 03:01
Initial version available at: https://bitbucket.org/ncoghlan/iterwalk/src I'll get it published to PyPI once the test suite is in a slightly better state (no filesystem based tests as yet).
msg147335 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-09 07:24
A related idea is updating os.walk tuples to be a custom object offering a namedtuple style API (similar to sys.float_info). I created a separate issue for that: #13375 I've also updated the title of this issue to describe the problem it aims to address rather than the specific solution that first occurred to me.
msg147389 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-10 02:40
I changed the package name to walkdir: https://bitbucket.org/ncoghlan/walkdir/overview
msg147580 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2011-11-13 23:27
And walkdir is now a published package: http://walkdir.readthedocs.org My plan for this issue now is to maintain walkdir as a standalone package for 2.7 and 3.2, but still add the functionality to shutil for 3.3+. However, I'll gather feedback on the PyPI module for a while before making any changes to the 3.3 repo.
msg152379 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-01-31 12:53
With the release of 0.3, I'm pretty happy with the WalkDir design (previous versions coerced the output values to ordinary 3-tuples, now it will pass along whatever the underlying iterable produces with changing the type at all).
msg152381 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-31 12:57
(speaking of which, I'd just mention I've published pathlib as a standalone package: http://pypi.python.org/pypi/pathlib/ )
msg153294 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-13 20:12
Not sure this is the right place to mention this, but I'm realizing handling symlink loops will be interesting. It is "interesting" because, when e.g. you are globbing, your glob's results may not include the symlinks' target path; so there are cases where you need to keep the symlink in the results. (and in the general case you may or may not want to keep some duplicates, depending on what you do with the links) I am adding rglob() to pathlib and currently ignoring (i.e. crashing :-)) symlink loops. I would be interested in any ideas as to how handle them correctly.
msg153295 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-13 20:16
Oh, forgot to mention, the term "symlink loop" itself is ambiguous. There are direct symlink loops: an example is a "some_dir/linky" link pointing to "../some_dir/linky". These will fail when resolving them. There are indirect symlink loops: "some_dir/linky" pointing to "../some_dir". Resolving them works fine, but recursively walking them produces an infinite recursion. Lots of fun :)
msg153316 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-02-14 03:22
WalkDir attempts to handle symlink loops, but the test suite doesn't currently ensure that that handling works as expected (I did some initial manual tests and haven't updated it since, though). It's... not trivial: https://bitbucket.org/ncoghlan/walkdir/src/1b239056e9d5/walkdir.py#cl-163
msg179290 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-01-07 22:35
Nick, I think this would be a great addition (I have often seen people trying to implement this in their own code, and I certainly did it myself). What's the status of walkdir, do you think it's mature enough to be merged?
msg179293 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-01-07 22:44
pathlib and walkdir are two nice piece of code, but do we need two modules? It would be nice to merge them into one unique module :-) (Or is it a stupid idea?)
msg179303 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-01-08 01:55
The problem with the current walkdir implementation is that without a rich path object you end up making a lot of redundant system calls. Combined with the inherent overhead of nested generators, it just feels bad (even in situations where worrying about the speed is truly a case of premature optimisation) The caching in pathlib should deal with the problem of redundant system calls, so rebasing walkdir on top of that would probably be a good idea (i.e. all walkdir APIs would produce pathlib paths, and implicit convert strings they encounter to paths).

History
Date	User	Action	Args
2022-04-11 14:57:22	admin	set	github: 57438
2013-03-06 20:01:03	gregory.p.smith	set	versions: + Python 3.4, - Python 3.3
2013-01-08 01:55:56	ncoghlan	set	messages: + msg179303
2013-01-07 22:44:06	vstinner	set	messages: + msg179293
2013-01-07 22:35:13	neologix	set	messages: + msg179290
2012-02-14 03:22:09	ncoghlan	set	messages: + msg153316
2012-02-13 20:16:35	pitrou	set	messages: + msg153295
2012-02-13 20:12:57	pitrou	set	nosy: + neologix messages: + msg153294
2012-01-31 12:57:51	pitrou	set	messages: + msg152381
2012-01-31 12:53:57	ncoghlan	set	messages: + msg152379
2011-12-16 12:44:35	giampaolo.rodola	set	nosy: + giampaolo.rodola
2011-11-13 23:27:13	ncoghlan	set	messages: + msg147580
2011-11-10 02:40:33	ncoghlan	set	messages: + msg147389
2011-11-09 07:24:24	ncoghlan	set	messages: + msg147335 title: Add shutil.filter_walk -> Improve tools for iterating over filesystem directories
2011-11-06 03:01:11	ncoghlan	set	messages: + msg147135
2011-11-06 01:25:02	ncoghlan	set	messages: + msg147132
2011-11-06 01:11:59	pitrou	set	nosy: + pitrou messages: + msg147131
2011-11-06 00:46:21	ncoghlan	set	messages: + msg147129
2011-11-06 00:43:33	ncoghlan	set	messages: + msg147128
2011-11-05 16:08:22	eric.araujo	set	messages: + msg147094
2011-11-05 16:07:58	eric.araujo	set	nosy: + vstinner messages: + msg147093
2011-10-21 13:52:40	eric.araujo	set	nosy: + eric.araujo
2011-10-20 02:05:30	ncoghlan	create