Issue 36602: Recursive directory list with pathlib.Path.iterdir

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/80783

classification

Title:	Recursive directory list with pathlib.Path.iterdir
Type:	enhancement	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Epic_Wink, SilentGhost, p-ganssle, pitrou, steve.dower, xtreak
Priority:	normal	Keywords:	patch

Created on 2019-04-11 10:59 by Epic_Wink, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 12785	open	python-dev, 2019-04-11 11:12

Messages (12)
msg339959 - (view)	Author: Laurie Opperman (Epic_Wink) *	Date: 2019-04-11 10:59
Currently, 'pathlib.Path.iterdir' can only list the contents of the instance directory. It is common to also want the contents of subdirectories recursively. The proposal is for 'pathlib.Path.iterdir' to have an argument 'recursive' which when 'True' will cause 'iterdir' to yield contents of subdirectories recursively. This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'. A decision would have to be made whether to continue to yield the subdirectories, or skip them. Another decision would be for whether each path should be resolved before checking if it is a directory to be recursed into.
msg339969 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2019-04-11 11:47
Is the behaviour you're proposing any different from using Path.rglob('*')?
msg339997 - (view)	Author: Paul Ganssle (p-ganssle) *	Date: 2019-04-11 22:10
> Is the behaviour you're proposing any different from using Path.rglob('')? I believe `rglob("")` is eager, while `iterdir` is lazy. @Epic_Wink: > This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'. One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops. There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.
msg340003 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-04-12 02:45
> I believe `rglob("*")` is eager, while `iterdir` is lazy. rglob and glob also return a generator. Slightly related, pathlib.walk was proposed in the past in python-ideas : https://mail.python.org/pipermail/python-ideas/2017-April/045398.html
msg340012 - (view)	Author: Laurie Opperman (Epic_Wink) *	Date: 2019-04-12 08:12
> Is the behaviour you're proposing any different from using `Path.rglob('')`? By that logic, we should remove `Path.iterdir()` in favour of `Path.glob('')`. In addition, having `iterdir` the way it is makes it easy for subclasses to extend its functionality (for example, one of my `Path` subclasses allows a callable to be passed to the `iterdir` which can filter paths) > One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops. I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want > There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options. As much as I want to say that I don't see a use-case for breadth-first file listing (when I list files, I expect the next file provided to be 'next to' the current file), users currently have no standard-library functionality to perform breadth-first searches as far as I know: they'd have to implement it themself or find it in a third-party library > Slightly related, pathlib.walk was proposed in the past in python-ideas... I've never really liked the interface to `walk`, personal preference
msg340042 - (view)	Author: Paul Ganssle (p-ganssle) *	Date: 2019-04-12 13:35
> rglob and glob also return a generator. My mistake, I didn't notice the `sorted` in the `rglob` documentation and thought it was emitting a list. > By that logic, we should remove `Path.iterdir()` in favour of `Path.glob('')`. What is* the case for why iterdir() is justified when `Path.glob('')` exists? Is it just discoverability? Is there some efficiency reason to do it? Of course, removing things (which can break existing code) and failing to add them (which cannot) have two different thresholds for when they can take place, so even if we decide "iterdir() is to glob('') as iterdir(recursive=True) is to rglob('')", that doesn't mean that we should remove iterdir() entirely if recursive=True is not added. > I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want I don't see that on the implementation here, but we can discuss this on the PR itself. I do think that skipping all* symlinks automatically with no option to follow them will be counter-intuitive for people. > I've never really liked the interface to `walk`, personal preference I kinda agree about the interface to `walk`, but it is worth noting that as we've seen in this thread, there are a bunch of plausible and slightly different ways to walk a directory: breadth-first or depth-first? following symlinks, following symlinks with cycle detection, or not following symlinks at all? emit the directory itself, or only emit its contents? It's worth taking into account that having two completely different complicated interfaces for recursively walking directories would be a usability challenge.
msg340072 - (view)	Author: Steve Dower (steve.dower) *	Date: 2019-04-12 15:51
Having spent more time than I'm proud of recursing through directories, I'd be happy enough with a convenience function that has sensible defaults. If I want breadth-first recursion (and I often do), I'll write it myself. I have a slight preference for getting all files in a directory before going deeper (which is not what the PR does), and I think that's most consistent with the current behaviour. I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I have had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly. If there's an easy and well-known algorithm for detecting infinite symlink recursion (e.g. resolve and check if it's a parent of itself) then do that and skip it, but don't return the targets.
msg340135 - (view)	Author: Paul Ganssle (p-ganssle) *	Date: 2019-04-13 00:13
> I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I have had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly. You mean treating symlinks to directories like files? I suppose that's a possibility, but I do think it will end up being a source of bugs around symlinking. Admittedly, it is apparently what rglob('*') does (just tested it - apparently it won't follow symlinks to directories), though I think it might be a better interface to try to break cycles rather than not follow symlinks (particularly since `iterdir` currently treats symlinks to directories as directories).
msg340144 - (view)	Author: Laurie Opperman (Epic_Wink) *	Date: 2019-04-13 03:10
Having `iterdir(recursive=True)` recurse into symlinks to directories would mean we would either not yield those symlinks, or we yield those symlinks and all other directories. I feel like not yielding directories is the way to go, but it's easy enough to check if a yielded path is a directory in application code. The current implementation of using recursion to list subdirectory contents doesn't seem to allow for the obvious implementation of symlink cycle-detection: keeping track of which (real) directories have been listed. PS: I've updated the pull-request to not follow symlinks to directories. This is not a final decision, but just updating to be in line with what I've implied up to this point
msg341059 - (view)	Author: Laurie Opperman (Epic_Wink) *	Date: 2019-04-29 09:07
I've updated the pull-request to list directories pointed to by listed symbolic links, preventing cyclic listing. An extra instance method `pathlib.Path._iterdir_recursive` was added for this. We still need to decide whether to yield the directories themselves rather than just the files under those directories. Very easy to implement this. One thing I just realised is that a symlink can point to a subdirectory further down the chain. The current implementation will list the files under that directory using the symlink as prefix rather than the full path. Shouldn't matter though.
msg342007 - (view)	Author: Laurie Opperman (Epic_Wink) *	Date: 2019-05-09 23:13
Would this change also have to copy implemented in the new ZipFile Pathlib API? https://bugs.python.org/issue36832
msg342687 - (view)	Author: Laurie Opperman (Epic_Wink) *	Date: 2019-05-17 03:44
I think I may have broken bedevere-bot by request change reviews before the PR was assigned...

History
Date	User	Action	Args
2022-04-11 14:59:13	admin	set	github: 80783
2019-06-05 00:36:30	Epic_Wink	set	versions: + Python 3.9, - Python 3.8
2019-05-17 03:44:07	Epic_Wink	set	messages: + msg342687
2019-05-09 23:13:11	Epic_Wink	set	messages: + msg342007
2019-04-29 09:07:16	Epic_Wink	set	messages: + msg341059
2019-04-13 03:10:34	Epic_Wink	set	messages: + msg340144
2019-04-13 00:13:09	p-ganssle	set	messages: + msg340135
2019-04-12 15:51:21	steve.dower	set	nosy: + steve.dower messages: + msg340072
2019-04-12 13:35:19	p-ganssle	set	messages: + msg340042
2019-04-12 08:12:49	Epic_Wink	set	messages: + msg340012
2019-04-12 02:45:44	xtreak	set	nosy: + xtreak messages: + msg340003
2019-04-11 22:10:45	p-ganssle	set	nosy: + p-ganssle messages: + msg339997
2019-04-11 11:47:11	SilentGhost	set	nosy: + SilentGhost messages: + msg339969
2019-04-11 11:17:23	xtreak	set	nosy: + pitrou
2019-04-11 11:12:24	python-dev	set	keywords: + patch stage: patch review pull_requests: + pull_request12712
2019-04-11 10:59:34	Epic_Wink	create