Issue 33414: Make shutil.copytree use os.scandir to take advantage of cached is_(dir|file|symlink)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77595

classification

Title:	Make shutil.copytree use os.scandir to take advantage of cached is_(dir\|file\|symlink)
Type:		Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	adelfino, giampaolo.rodola, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2018-05-03 04:09 by adelfino, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 6697	closed	adelfino, 2018-05-08 00:26

Messages (5)
msg316108 - (view)	Author: Andrés Delfino (adelfino) *	Date: 2018-05-03 04:09
Right now we are using os.path.is(dir\|file\|symlink) in shutil.copytree, but taking advantage of os.scandir's is_(dir\|file\|symlink) seems the way to go. I'll make a PR with to start the discussion with a proof of concept.
msg316111 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-05-03 07:11
Could you show an example that exposes a significant overhead of the current implementation? There were reasons of using os.scandir() in os.walk() and glob.glob(). You may scan thousands of files and performs a heavy I/O only with few of them. There was smaller reason of using it in shutil.rmtree() -- removing an entry from the directory is fast, but likely you will spent much more time for creating a tree. But I don't know reasons of using it in shutil.copytree() in which you need to performs a heavy I/O with every file and directory.
msg316122 - (view)	Author: Andrés Delfino (adelfino) *	Date: 2018-05-03 12:20
To be frank, I just searched the tree for uses of listdir() combined with isdir()/isfile()/issymlink(). Thought that using scandir() would make sense, and didn't think of a reason for not using it. That being said, I cannot state a case and I'll be happy to close the PR if there's no need for this :)
msg316279 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-05-08 05:47
The code with using scandir() is more complex and is different enough from the code with using listdir() for having significant risk of introducing bugs. Also using scandir() introduces a risk of leaking file descriptors. We don't rewrite the code without good reasons. If you propose performance enhancement, please provide benchmark results that expose the benefit of this change. If it is not large enough, it is not worth to do. Actually your change looks making the code slower: it reads the directory twice.
msg316298 - (view)	Author: Andrés Delfino (adelfino) *	Date: 2018-05-08 22:15
I believe you are right about being conservative with changing working code, specially when there's no proof about performance being improved. I think its best to close this ticket. I'd do it myself, but I know what resolution applies here.

History
Date	User	Action	Args
2022-04-11 14:59:00	admin	set	github: 77595
2018-05-30 03:52:53	giampaolo.rodola	set	nosy: + giampaolo.rodola
2018-05-09 10:17:47	serhiy.storchaka	set	status: open -> closed resolution: rejected stage: patch review -> resolved
2018-05-08 22:15:37	adelfino	set	messages: + msg316298
2018-05-08 05:47:20	serhiy.storchaka	set	messages: + msg316279
2018-05-08 00:26:41	adelfino	set	keywords: + patch stage: patch review pull_requests: + pull_request6416
2018-05-03 12:23:24	vstinner	set	nosy: + vstinner
2018-05-03 12:20:52	adelfino	set	messages: + msg316122
2018-05-03 07:11:37	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg316111
2018-05-03 04:09:33	adelfino	create