classification
Title: distutil findall can choke with recursive symlinks (performance)
Type: performance Stage: resolved
Components: Distutils Versions: Python 3.11, Python 3.10, Python 3.9, Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: corona10, dstufft, eric.araujo, ssbarnea
Priority: normal Keywords: patch

Created on 2021-06-23 09:20 by ssbarnea, last changed 2021-06-24 16:43 by corona10. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 26873 open ssbarnea, 2021-06-23 10:00
Messages (2)
msg396394 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2021-06-23 09:20
As the results of investigating a very poor performance of pip while trying to install some code I was able to identify that the root cause was the current implementation of distutils.filelist.findall or to be more precise the _find_all_simple function, which does followsymlinks but without any measures for preventing recursivity and duplicates.

To give an idea in my case it was taking 5-10minutes to run while the CPU was at 100%, for a repository with 95k files (most of them temp inside .tox folders). Removal of the symlinks did make it run in ~5s.

IMHO, _find_all_simple should normalize paths and avoid returning any duplicates.


Realted: https://bugs.launchpad.net/pbr/+bug/1933311
msg396499 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-06-24 16:42
Since the distutils is deprecated at PEP632, I would like to suggest changing the implementation to use setuptools.
History
Date User Action Args
2021-06-24 16:43:21corona10setstatus: open -> closed
stage: patch review -> resolved
2021-06-24 16:43:08corona10setresolution: wont fix
2021-06-24 16:42:56corona10setnosy: + corona10
messages: + msg396499
2021-06-23 10:00:09ssbarneasetkeywords: + patch
stage: patch review
pull_requests: + pull_request25448
2021-06-23 09:20:53ssbarneasettype: performance
2021-06-23 09:20:31ssbarneacreate