This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: shutil.rmtree is inefficient due to listdir() instead of scandir()
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: 25996 Superseder:
Assigned To: Nosy List: enkore, josh.r, nh2, pitrou, serhiy.storchaka, xiang.zhang
Priority: normal Keywords: patch

Created on 2016-10-30 20:02 by enkore, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
shutil-rmtree-scandir.patch serhiy.storchaka, 2016-11-07 17:29 review
bench_rmtree.py serhiy.storchaka, 2016-11-24 08:27
Pull Requests
URL Status Linked Edit
PR 4085 merged serhiy.storchaka, 2017-10-23 13:25
Messages (9)
msg279745 - (view) Author: Marian Beermann (enkore) Date: 2016-10-30 20:02
The use of os.listdir severely limits the speed of this function on
anything except solid-state drives.

Using the new-in-Python 3.5 os.scandir should eliminate this
bottleneck.
msg279801 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2016-10-31 16:37
You need to cache the names up front because the loop is unlinking entries, and readdir isn't consistent when the directory entries are mutated between calls. https://github.com/kripken/emscripten/issues/2528

FindFirstFile/FindNextFile likely has similar issues, even if they're not consistently seen (due to DeleteFile itself not guaranteeing deletion until the last handle to the file is closed).

scandir might save some stat calls, but you'd need to convert it from generator to list before the loop begins, which would limit the savings a bit.
msg279813 - (view) Author: Marian Beermann (enkore) Date: 2016-10-31 17:40
The main issue on *nix is more likely that by using listdir you get directory order, while what you really need is inode ordering. scandir allows for that, since you get the inode from the DirEntry with no extra syscalls - especially without an open() or stat().

Other optimizations are also possible. For example opening the directory and using unlinkat() would likely shave off a bit of CPU. But the dominating factor here is likely the bad access pattern.
msg280213 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-07 17:29
Proposed patch implements shutil.rmtree using os.scandir. Needed file descriptors support in os.scandir (issue25996). I did not test how this affects the performance of shutil.rmtree.
msg281618 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-24 08:27
Benchmarks show about 20% speed up.
msg305000 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-25 16:25
Following Antoine's suggestion the patch now makes shutil.rmtree() using os.scandir() on all platforms.

I doubt about one thing. This patch changes os.listdir passed to the onerror handler to os.scandir. This can break a user code that checks if the first argument in onerror is os.listdir. If keep this change, it should be documented in the "Porting to 3.7" section. Alternatively, we can continue passing os.listdir if os.scandir() failed despites the fact that os.listdir no longer used.
msg305001 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-10-25 16:25
I think we should change to os.scandir.  No need to accumulate compatibility baggage like that.
msg305556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-04 12:16
New changeset d4d79bc1ff91b04625c312f0219c89aabcd19ce4 by Serhiy Storchaka in branch 'master':
bpo-28564: Use os.scandir() in shutil.rmtree(). (#4085)
https://github.com/python/cpython/commit/d4d79bc1ff91b04625c312f0219c89aabcd19ce4
msg309219 - (view) Author: Niklas Hambüchen (nh2) Date: 2017-12-30 05:15
I've filed https://bugs.python.org/issue32453, which is about O(n^2) deletion behaviour for large directories.
History
Date User Action Args
2022-04-11 14:58:38adminsetgithub: 72750
2017-12-30 05:15:42nh2setnosy: + nh2
messages: + msg309219
2017-11-04 12:46:52serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-11-04 12:16:38serhiy.storchakasetmessages: + msg305556
2017-10-25 16:25:57pitrousetmessages: + msg305001
2017-10-25 16:25:02serhiy.storchakasetnosy: + pitrou
messages: + msg305000
2017-10-23 13:25:24serhiy.storchakasetpull_requests: + pull_request4055
2016-11-24 08:27:44serhiy.storchakasetfiles: + bench_rmtree.py

messages: + msg281618
2016-11-07 17:29:17serhiy.storchakasetfiles: + shutil-rmtree-scandir.patch
keywords: + patch
messages: + msg280213

stage: patch review
2016-10-31 23:53:40serhiy.storchakasetdependencies: + Add support of file descriptor in os.scandir()
2016-10-31 17:40:23enkoresetmessages: + msg279813
2016-10-31 16:37:10josh.rsetnosy: + josh.r
messages: + msg279801
2016-10-31 11:23:29xiang.zhangsetnosy: + xiang.zhang
2016-10-30 20:40:04serhiy.storchakasetnosy: + serhiy.storchaka

type: enhancement
versions: + Python 3.7, - Python 3.5
2016-10-30 20:02:39enkorecreate