classification
Title: Make shutil.copytree use os.scandir to take advantage of cached is_(dir|file|symlink)
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: adelfino, giampaolo.rodola, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-05-03 04:09 by adelfino, last changed 2018-05-30 03:52 by giampaolo.rodola. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 6697 closed adelfino, 2018-05-08 00:26
Messages (5)
msg316108 - (view) Author: Andrés Delfino (adelfino) * (Python triager) Date: 2018-05-03 04:09
Right now we are using os.path.is(dir|file|symlink) in shutil.copytree, but taking advantage of os.scandir's is_(dir|file|symlink) seems the way to go.

I'll make a PR with to start the discussion with a proof of concept.
msg316111 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-03 07:11
Could you show an example that exposes a significant overhead of the current implementation?

There were reasons of using os.scandir() in os.walk() and glob.glob(). You may scan thousands of files and performs a heavy I/O only with few of them. There was smaller reason of using it in shutil.rmtree() -- removing an entry from the directory is fast, but likely you will spent much more time for creating a tree.

But I don't know reasons of using it in shutil.copytree() in which you need to performs a heavy I/O with every file and directory.
msg316122 - (view) Author: Andrés Delfino (adelfino) * (Python triager) Date: 2018-05-03 12:20
To be frank, I just searched the tree for uses of listdir() combined with isdir()/isfile()/issymlink(). Thought that using scandir() would make sense, and didn't think of a reason for not using it. That being said, I cannot state a case and I'll be happy to close the PR if there's no need for this :)
msg316279 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-08 05:47
The code with using scandir() is more complex and is different enough from the code with using listdir() for having significant risk of introducing bugs. Also using scandir() introduces a risk of leaking file descriptors.

We don't rewrite the code without good reasons. If you propose performance enhancement, please provide benchmark results that expose the benefit of this change. If it is not large enough, it is not worth to do.

Actually your change looks making the code slower: it reads the directory twice.
msg316298 - (view) Author: Andrés Delfino (adelfino) * (Python triager) Date: 2018-05-08 22:15
I believe you are right about being conservative with changing working code, specially when there's no proof about performance being improved.

I think its best to close this ticket. I'd do it myself, but I know what resolution applies here.
History
Date User Action Args
2018-05-30 03:52:53giampaolo.rodolasetnosy: + giampaolo.rodola
2018-05-09 10:17:47serhiy.storchakasetstatus: open -> closed
resolution: rejected
stage: patch review -> resolved
2018-05-08 22:15:37adelfinosetmessages: + msg316298
2018-05-08 05:47:20serhiy.storchakasetmessages: + msg316279
2018-05-08 00:26:41adelfinosetkeywords: + patch
stage: patch review
pull_requests: + pull_request6416
2018-05-03 12:23:24vstinnersetnosy: + vstinner
2018-05-03 12:20:52adelfinosetmessages: + msg316122
2018-05-03 07:11:37serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg316111
2018-05-03 04:09:33adelfinocreate