Issue 23916: module importing performance regression

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68104

classification

Title:	module importing performance regression
Type:	performance	Stage:	resolved
Components:		Versions:	Python 3.4, Python 3.5

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	brett.cannon, daveroundy, eric.snow, ncoghlan, pitrou
Priority:	normal	Keywords:

Created on 2015-04-11 20:07 by daveroundy, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	daveroundy, 2017-07-16 12:02	script to demonstrate performance regression.

Messages (10)
msg240491 - (view)	Author: David Roundy (daveroundy)	Date: 2015-04-11 20:07
I have observed a performance regression in module importing. In python 3.4.2, importing a module from the current directory (where the script is located) causes the entire directory to be read. When there are many files in this directory, this can cause the script to run very slowly. In python 2.7.9, this behavior is not present. It would be preferable (in my opinion) to revert the change that causes python to read the entire user directory.
msg240492 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-11 20:37
This change is actually an optimization. The directory is only read once and its contents are then cached, which allows for much quicker imports when multiple modules are in the directory (common case of a Python package). Can you tell us more about your setup? - how many files are in the directory - what filesystem is used - whether the filesystem is local or remote (e.g. network-attached) - your OS and OS version Also, how long is "very slowly"?
msg240493 - (view)	Author: David Roundy (daveroundy)	Date: 2015-04-11 20:50
I had suspected that might be the case. At this point mostly it's just a test case where I generated a lot of files to demonstrate the issue. In my test case hello world with one module import takes a minute and 40 seconds. I could make it take longer, of course, by creating more files. I do think scaling should be a consideration when introducing optimizations, when if getdents is usually pretty fast. If the script directory is normally the last one in the search path couldn't you skip the listing of that directory without losing your optimization? On Sat, Apr 11, 2015, 1:37 PM Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou added the comment: > > This change is actually an optimization. The directory is only read once > and its contents are then cached, which allows for much quicker imports > when multiple modules are in the directory (common case of a Python > package). > > Can you tell us more about your setup? > - how many files are in the directory > - what filesystem is used > - whether the filesystem is local or remote (e.g. network-attached) > - your OS and OS version > > Also, how long is "very slowly"? > > ---------- > nosy: +pitrou > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue23916> > _______________________________________ >
msg240494 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-11 21:27
I was asking questions because I wanted to have more precise data. I can't reproduce here: even with 500000 files in a directory, the first import takes 0.2s, not one minute.
msg240500 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-11 21:52
As for your question: > If the script > directory is normally the last one in the search path couldn't you > skip the > listing of that directory without losing your optimization? Given the way the code is architected, that would complicate things significantly. Also it would introduce a rather unexpected discrepancy.
msg240514 - (view)	Author: David Roundy (daveroundy)	Date: 2015-04-12 00:20
My tests involved 8 million files on an ext4 file system. I expect that accounts for the difference. It's true that it's an excessive number of files, and maybe the best option is to ignore the problem. On Sat, Apr 11, 2015 at 2:52 PM Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou added the comment: > > As for your question: > > > If the script > > directory is normally the last one in the search path couldn't you > > skip the > > listing of that directory without losing your optimization? > > Given the way the code is architected, that would complicate things > significantly. Also it would introduce a rather unexpected discrepancy. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue23916> > _______________________________________ >
msg240515 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-12 00:34
Indeed, that doesn't sound like something we want to support. I'm closing then.
msg298433 - (view)	Author: David Roundy (daveroundy)	Date: 2017-07-16 12:02
Here is a little script to demonstrate the regression (which yes, is still bothering me).
msg298434 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-07-16 12:25
Thanks for the reproducer. I haven't changed my mind on the resolution, as it is an extremely unlikely usecase (a directory with 1e8 files is painful to manage with standard command-line tools). I suggest you change your approach, for example you could use a directory hashing scheme to spread the files into smaller subdirectories.
msg298450 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2017-07-16 21:19
I agree with Antoine that this shouldn't change. Having said that, it wouldn't be hard to write your own finder using importlib that doesn't get the directory contents and instead checks for the file directly (and you could even set it just for your troublesome directory to get the performance benefit from the default finder). On Sun, Jul 16, 2017, 05:25 Antoine Pitrou, <report@bugs.python.org> wrote: > > Antoine Pitrou added the comment: > > Thanks for the reproducer. I haven't changed my mind on the resolution, > as it is an extremely unlikely usecase (a directory with 1e8 files is > painful to manage with standard command-line tools). I suggest you change > your approach, for example you could use a directory hashing scheme to > spread the files into smaller subdirectories. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue23916> > _______________________________________ >

History
Date	User	Action	Args
2022-04-11 14:58:15	admin	set	github: 68104
2017-07-16 21:19:52	brett.cannon	set	messages: + msg298450
2017-07-16 12:25:45	pitrou	set	messages: + msg298434
2017-07-16 12:02:59	daveroundy	set	files: + test.py type: performance messages: + msg298433 versions: + Python 3.5
2015-04-12 00:34:21	pitrou	set	status: open -> closed resolution: wont fix messages: + msg240515 stage: resolved
2015-04-12 00:20:59	daveroundy	set	messages: + msg240514
2015-04-11 21:52:04	pitrou	set	messages: + msg240500
2015-04-11 21:27:53	pitrou	set	messages: + msg240494
2015-04-11 20:50:13	daveroundy	set	messages: + msg240493
2015-04-11 20:37:03	pitrou	set	nosy: + pitrou messages: + msg240492
2015-04-11 20:20:41	serhiy.storchaka	set	nosy: + brett.cannon, ncoghlan, eric.snow
2015-04-11 20:07:34	daveroundy	create