classification
Title: Recursive directory list with pathlib.Path.iterdir
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Epic_Wink, SilentGhost, p-ganssle, pitrou, steve.dower, xtreak
Priority: normal Keywords: patch

Created on 2019-04-11 10:59 by Epic_Wink, last changed 2019-06-05 00:36 by Epic_Wink.

Pull Requests
URL Status Linked Edit
PR 12785 open python-dev, 2019-04-11 11:12
Messages (12)
msg339959 - (view) Author: Laurie Opperman (Epic_Wink) * Date: 2019-04-11 10:59
Currently, 'pathlib.Path.iterdir' can only list the contents of the instance directory. It is common to also want the contents of subdirectories recursively.

The proposal is for 'pathlib.Path.iterdir' to have an argument 'recursive' which when 'True' will cause 'iterdir' to yield contents of subdirectories recursively.

This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'.

A decision would have to be made whether to continue to yield the subdirectories, or skip them. Another decision would be for whether each path should be resolved before checking if it is a directory to be recursed into.
msg339969 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-04-11 11:47
Is the behaviour you're proposing any different from using Path.rglob('*')?
msg339997 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2019-04-11 22:10
> Is the behaviour you're proposing any different from using Path.rglob('*')?

I believe `rglob("*")` is eager, while `iterdir` is lazy.

@Epic_Wink:
> This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'.

One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops.

There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.
msg340003 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-04-12 02:45
> I believe `rglob("*")` is eager, while `iterdir` is lazy.

rglob and glob also return a generator.

Slightly related, pathlib.walk was proposed in the past in python-ideas : https://mail.python.org/pipermail/python-ideas/2017-April/045398.html
msg340012 - (view) Author: Laurie Opperman (Epic_Wink) * Date: 2019-04-12 08:12
> Is the behaviour you're proposing any different from using `Path.rglob('*')`?

By that logic, we should remove `Path.iterdir()` in favour of `Path.glob('*')`. In addition, having `iterdir` the way it is makes it easy for subclasses to extend its functionality (for example, one of my `Path` subclasses allows a callable to be passed to the `iterdir` which can filter paths)

> One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops.

I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want

> There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.

As much as I want to say that I don't see a use-case for breadth-first file listing (when I list files, I expect the next file provided to be 'next to' the current file), users currently have no standard-library functionality to perform breadth-first searches as far as I know: they'd have to implement it themself or find it in a third-party library

> Slightly related, pathlib.walk was proposed in the past in python-ideas...

I've never really liked the interface to `walk`, personal preference
msg340042 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2019-04-12 13:35
> rglob and glob also return a generator.

My mistake, I didn't notice the `sorted` in the `rglob` documentation and thought it was emitting a list.

> By that logic, we should remove `Path.iterdir()` in favour of `Path.glob('*')`.

What *is* the case for why iterdir() is justified when `Path.glob('*')` exists? Is it just discoverability? Is there some efficiency reason to do it?

Of course, removing things (which can break existing code) and failing to add them (which cannot) have two different thresholds for when they can take place, so even if we decide "iterdir() is to glob('*') as iterdir(recursive=True) is to rglob('*')", that doesn't mean that we should remove iterdir() entirely if recursive=True is not added.

> I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want

I don't see that on the implementation here, but we can discuss this on the PR itself. I do think that skipping *all* symlinks automatically with no option to follow them will be counter-intuitive for people.

> I've never really liked the interface to `walk`, personal preference

I kinda agree about the interface to `walk`, but it is worth noting that as we've seen in this thread, there are a bunch of plausible and slightly different ways to walk a directory: breadth-first or depth-first? following symlinks, following symlinks with cycle detection, or not following symlinks at all? emit the directory itself, or only emit its contents? It's worth taking into account that having two completely different complicated interfaces for recursively walking directories would be a usability challenge.
msg340072 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-04-12 15:51
Having spent more time than I'm proud of recursing through directories, I'd be happy enough with a convenience function that has sensible defaults.

If I want breadth-first recursion (and I often do), I'll write it myself. I have a slight preference for getting all files in a directory before going deeper (which is not what the PR does), and I think that's most consistent with the current behaviour.

I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly.

If there's an easy and well-known algorithm for detecting infinite symlink recursion (e.g. resolve and check if it's a parent of itself) then do that and skip it, but don't return the targets.
msg340135 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2019-04-13 00:13
> I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly.

You mean treating symlinks to directories like files? I suppose that's a possibility, but I do think it will end up being a source of bugs around symlinking.

Admittedly, it *is* apparently what rglob('*') does (just tested it - apparently it won't follow symlinks to directories), though I think it might be a better interface to try to break cycles rather than not follow symlinks (particularly since `iterdir` currently treats symlinks to directories as directories).
msg340144 - (view) Author: Laurie Opperman (Epic_Wink) * Date: 2019-04-13 03:10
Having `iterdir(recursive=True)` recurse into symlinks to directories would mean we would either not yield those symlinks, or we yield those symlinks and all other directories.

I feel like not yielding directories is the way to go, but it's easy enough to check if a yielded path is a directory in application code.

The current implementation of using recursion to list subdirectory contents doesn't seem to allow for the obvious implementation of symlink cycle-detection: keeping track of which (real) directories have been listed.

PS: I've updated the pull-request to not follow symlinks to directories. This is not a final decision, but just updating to be in line with what I've implied up to this point
msg341059 - (view) Author: Laurie Opperman (Epic_Wink) * Date: 2019-04-29 09:07
I've updated the pull-request to list directories pointed to by listed symbolic links, preventing cyclic listing. An extra instance method `pathlib.Path._iterdir_recursive` was added for this.

We still need to decide whether to yield the directories themselves rather than just the files under those directories. Very easy to implement this.

One thing I just realised is that a symlink can point to a subdirectory further down the chain. The current implementation will list the files under that directory using the symlink as prefix rather than the full path. Shouldn't matter though.
msg342007 - (view) Author: Laurie Opperman (Epic_Wink) * Date: 2019-05-09 23:13
Would this change also have to copy implemented in the new ZipFile Pathlib API? https://bugs.python.org/issue36832
msg342687 - (view) Author: Laurie Opperman (Epic_Wink) * Date: 2019-05-17 03:44
I think I may have broken bedevere-bot by request change reviews before the PR was assigned...
History
Date User Action Args
2019-06-05 00:36:30Epic_Winksetversions: + Python 3.9, - Python 3.8
2019-05-17 03:44:07Epic_Winksetmessages: + msg342687
2019-05-09 23:13:11Epic_Winksetmessages: + msg342007
2019-04-29 09:07:16Epic_Winksetmessages: + msg341059
2019-04-13 03:10:34Epic_Winksetmessages: + msg340144
2019-04-13 00:13:09p-gansslesetmessages: + msg340135
2019-04-12 15:51:21steve.dowersetnosy: + steve.dower
messages: + msg340072
2019-04-12 13:35:19p-gansslesetmessages: + msg340042
2019-04-12 08:12:49Epic_Winksetmessages: + msg340012
2019-04-12 02:45:44xtreaksetnosy: + xtreak
messages: + msg340003
2019-04-11 22:10:45p-gansslesetnosy: + p-ganssle
messages: + msg339997
2019-04-11 11:47:11SilentGhostsetnosy: + SilentGhost
messages: + msg339969
2019-04-11 11:17:23xtreaksetnosy: + pitrou
2019-04-11 11:12:24python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request12712
2019-04-11 10:59:34Epic_Winkcreate