This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: module importing performance regression
Type: performance Stage: resolved
Components: Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, daveroundy, eric.snow, ncoghlan, pitrou
Priority: normal Keywords:

Created on 2015-04-11 20:07 by daveroundy, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test.py daveroundy, 2017-07-16 12:02 script to demonstrate performance regression.
Messages (10)
msg240491 - (view) Author: David Roundy (daveroundy) Date: 2015-04-11 20:07
I have observed a performance regression in module importing.  In python 3.4.2, importing a module from the current directory (where the script is located) causes the entire directory to be read.  When there are many files in this directory, this can cause the script to run very slowly.

In python 2.7.9, this behavior is not present.

It would be preferable (in my opinion) to revert the change that causes python to read the entire user directory.
msg240492 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-11 20:37
This change is actually an optimization. The directory is only read once and its contents are then cached, which allows for much quicker imports when multiple modules are in the directory (common case of a Python package).

Can you tell us more about your setup?
- how many files are in the directory
- what filesystem is used
- whether the filesystem is local or remote (e.g. network-attached)
- your OS and OS version

Also, how long is "very slowly"?
msg240493 - (view) Author: David Roundy (daveroundy) Date: 2015-04-11 20:50
I had suspected that might be the case. At this point mostly it's just a
test case where I generated a lot of files to demonstrate the issue.  In my
test case hello world with one module import takes a minute and 40 seconds.
I could make it take longer, of course, by creating more files.

I do think scaling should be a consideration when introducing
optimizations, when if getdents is usually pretty fast.  If the script
directory is normally the last one in the search path couldn't you skip the
listing of that directory without losing your optimization?

On Sat, Apr 11, 2015, 1:37 PM Antoine Pitrou <report@bugs.python.org> wrote:

>
> Antoine Pitrou added the comment:
>
> This change is actually an optimization. The directory is only read once
> and its contents are then cached, which allows for much quicker imports
> when multiple modules are in the directory (common case of a Python
> package).
>
> Can you tell us more about your setup?
> - how many files are in the directory
> - what filesystem is used
> - whether the filesystem is local or remote (e.g. network-attached)
> - your OS and OS version
>
> Also, how long is "very slowly"?
>
> ----------
> nosy: +pitrou
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue23916>
> _______________________________________
>
msg240494 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-11 21:27
I was asking questions because I wanted to have more precise data. I can't reproduce here: even with 500000 files in a directory, the first import takes 0.2s, not one minute.
msg240500 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-11 21:52
As for your question:

> If the script
> directory is normally the last one in the search path couldn't you
> skip the
> listing of that directory without losing your optimization?

Given the way the code is architected, that would complicate things significantly. Also it would introduce a rather unexpected discrepancy.
msg240514 - (view) Author: David Roundy (daveroundy) Date: 2015-04-12 00:20
My tests involved 8 million files on an ext4 file system.  I expect that
accounts for the difference.  It's true that it's an excessive number of
files, and maybe the best option is to ignore the problem.

On Sat, Apr 11, 2015 at 2:52 PM Antoine Pitrou <report@bugs.python.org>
wrote:

>
> Antoine Pitrou added the comment:
>
> As for your question:
>
> > If the script
> > directory is normally the last one in the search path couldn't you
> > skip the
> > listing of that directory without losing your optimization?
>
> Given the way the code is architected, that would complicate things
> significantly. Also it would introduce a rather unexpected discrepancy.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue23916>
> _______________________________________
>
msg240515 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-12 00:34
Indeed, that doesn't sound like something we want to support. I'm closing then.
msg298433 - (view) Author: David Roundy (daveroundy) Date: 2017-07-16 12:02
Here is a little script to demonstrate the regression (which yes, is still bothering me).
msg298434 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-16 12:25
Thanks for the reproducer.  I haven't changed my mind on the resolution, as it is an extremely unlikely usecase (a directory with 1e8 files is painful to manage with standard command-line tools).  I suggest you change your approach, for example you could use a directory hashing scheme to spread the files into smaller subdirectories.
msg298450 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2017-07-16 21:19
I agree with Antoine that this shouldn't change. Having said that, it
wouldn't be hard to write your own finder using importlib that doesn't get
the directory contents and instead checks for the file directly (and you
could even set it just for your troublesome directory to get the
performance benefit from the default finder).

On Sun, Jul 16, 2017, 05:25 Antoine Pitrou, <report@bugs.python.org> wrote:

>
> Antoine Pitrou added the comment:
>
> Thanks for the reproducer.  I haven't changed my mind on the resolution,
> as it is an extremely unlikely usecase (a directory with 1e8 files is
> painful to manage with standard command-line tools).  I suggest you change
> your approach, for example you could use a directory hashing scheme to
> spread the files into smaller subdirectories.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue23916>
> _______________________________________
>
History
Date User Action Args
2022-04-11 14:58:15adminsetgithub: 68104
2017-07-16 21:19:52brett.cannonsetmessages: + msg298450
2017-07-16 12:25:45pitrousetmessages: + msg298434
2017-07-16 12:02:59daveroundysetfiles: + test.py
type: performance
messages: + msg298433

versions: + Python 3.5
2015-04-12 00:34:21pitrousetstatus: open -> closed
resolution: wont fix
messages: + msg240515

stage: resolved
2015-04-12 00:20:59daveroundysetmessages: + msg240514
2015-04-11 21:52:04pitrousetmessages: + msg240500
2015-04-11 21:27:53pitrousetmessages: + msg240494
2015-04-11 20:50:13daveroundysetmessages: + msg240493
2015-04-11 20:37:03pitrousetnosy: + pitrou
messages: + msg240492
2015-04-11 20:20:41serhiy.storchakasetnosy: + brett.cannon, ncoghlan, eric.snow
2015-04-11 20:07:34daveroundycreate