classification
Title: Use scandir() to speed up the glob module
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: 16620 25994 Superseder:
Assigned To: serhiy.storchaka Nosy List: Dima.Tisnek, benhoyt, gvanrossum, haypo, python-dev, serhiy.storchaka, xdegaye
Priority: normal Keywords: patch

Created on 2015-11-10 11:18 by xdegaye, last changed 2016-09-07 11:10 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
glob_isdir.patch xdegaye, 2015-11-10 11:18 review
glob_isdir_2.patch xdegaye, 2015-11-10 15:35 review
glob_scandir.patch serhiy.storchaka, 2016-01-02 15:22 review
glob_scandir_2.patch serhiy.storchaka, 2016-01-03 08:23 review
glob_scandir_doc.patch xdegaye, 2016-01-08 13:53 review
glob_scandir_2_diff.patch xdegaye, 2016-01-10 14:53
glob_scandir_3.patch xdegaye, 2016-01-10 14:53 review
glob_scandir_4.patch serhiy.storchaka, 2016-01-11 08:33 review
glob_scandir_5.patch serhiy.storchaka, 2016-02-11 11:57 review
Messages (30)
msg254440 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2015-11-10 11:18
The glob module happily joins names of regular files together with os.path.join() or attempts to list the files contained into a regular file (sic). The same 'except os.error' statement is used to handle both these cases and the case of a non readable directory.

The attached patch makes the code more correct and easier to understand.
msg254449 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2015-11-10 15:35
This second patch rewrites the conditionals that decide whether to call _iglob() recursively, in a more natural way and without changing the behavior from the first patch.
Note that when 'dirname == pathname', then basename is empty and glob2() or glob1() are not called.

The patch also refactors the code with a new _listdir() function.
msg257347 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-02 15:22
In general the patch LGTM. It allows to speed up glob by 5-10% in warn test. But much more gain we can achieve by using os.scandir(). Here are results of microbenchmarks:

$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
Unpatched:        201 msec per loop
Using isdir():    181 msec per loop
Using scandir():  65.2 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
Unpatched:        2.06 sec per loop
Using isdir():    1.92 sec per loop
Using scandir():  820 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
Unpatched:        1.77 sec per loop
Using isdir():    1.61 sec per loop
Using scandir():  431 msec per loop

Yet one benefit is that iglob() now yields path names without the delay for reading the full content of a directory (see issue22167).
msg257354 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-01-02 19:49
I don't think it is a good idea to break backward compatibility here even if the globN are internal functions.  Better would be to continue to have globN functions that support the existing API and emit deprecation warnings.
msg257403 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-03 08:23
I think so too. I just wanted someone to confirmed that it is not overcautiousness. For now glob1() is used only in one place in the stdlib (issue16620).

But there was other problem with previous patch, the same as with current implementation of os.walk() (issue25995). It makes glob to use a lot of file descriptors.

Updated patch lefts deprecated glob1() and glob2() and makes glob to consume all scandir iterator before starting to yield values (but the problem with fd leaks on non-refcounted implementations is still left, issue25994). This doesn't affect performance, but lefts the issue with delaying (issue22167).
msg257421 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-03 14:55
os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion.
The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.
msg257423 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-03 16:30
I may be missing something, anyway here are few comments on the patch:
Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ?
If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)
msg257424 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-03 17:20
> os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion.

No, os.scandir() is called non-recursively in _iterdir(), and since _iterdir() results are always accumulated in a list, a recursion starts only after exhausting the os.scandir() iterator and closing the file descriptor. We need first to resolve issue25994 to close the file descriptor explicitly.

> The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.

Patched code passes existing tests. Do you have additional tests?

> Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ?

Ah, I thought is_dir() doesn't follow symlinks. Yes, now the call to entry.is_symlink() is not necessary.

> If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)

Yes, but this can complicate the rest of the code. _rlistdir() is called with dironly=False only when the pattern ends with '**'. I'm not sure this is enough important case for optimization. In most cases '**' is used in the middle of the pattern, and all names yielded by _iterdir() are directory names (or broken symlinks).
msg257456 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-04 11:20
> and since _iterdir() results are always accumulated in a list

Right, I failed to note that point. And so, since the file descriptor opened by os.scandir() is closed within each call to recursive _rlistdir(), then my other comment about EMFILE does not stand as well.
msg257655 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-01-06 22:59
Related issue: #26032 "Use scandir() to speed up pathlib globbing".
msg257659 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-06 23:08
(IOW once this patch has been applied maybe you can do the same for globbing in pathlib as requested in issue #26032.)
msg257758 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-08 13:53
Adding a doc patch.
msg257912 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-10 14:53
FWIW I have followed the idea of having _iterdir() yielding the DirEntry entry instead of the name in Serhiy's patch.  There is a slight performance gain. Now _glob0() and _glob1() do return a list of directories when dironly is true but there is now another place where OSError must be tracked, so it is not clear if this is worth it.

glob_scandir_2_diff.patch is the differential patch between glob_scandir_2.patch and glob_scandir_3.patch.

Here are the performance tests run with both patches.

$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
glob_scandir_2.patch: 33.1 msec per loop
glob_scandir_3.patch: 33.8 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
glob_scandir_2.patch: 927 msec per loop
glob_scandir_3.patch: 850 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
glob_scandir_2.patch: 423 msec per loop
glob_scandir_3.patch: 337 msec per loop
msg257930 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-10 20:04
Oops, the list comprehension in _glob1() of glob_scandir_3.patch should be instead: names = [x.name for x in _iterdir(dirname, dironly)]
This does not change the tests results except it improves glob('/usr/lib*/**/*', True) that gives now 820 msec instead of 927 msec with glob_scandir_2.patch.
msg257938 - (view) Author: Xavier de Gaye (xdegaye) * (Python committer) Date: 2016-01-10 21:09
Actually the microbenchmarks comparison between glob_scandir_2.patch and glob_scandir_3.patch must be made by removing the call to entry.is_symlink() also in glob_scandir_2.patch.
In that case the microbenchmarks give about the same results for each patch.
Sorry for the noise.
msg257947 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-11 03:14
@serhiy, I assume this is in your capable hands?

Personally I'd like to have seen the refactoring in a separate diff from the switch to scandir(), just to ensure that the refactoring does not change anything and to make it easier to read the changes for the scandir() switch.
msg257953 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-11 08:33
Thank you Xavier for correcting the documentation.

Here is minimal patch that switches the glob module to scandir().
msg257954 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-11 12:18
Results of microbenchmarks for glob_scandir_4.patch:

$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
Unpatched: 275 msec per loop
Patched:   79.4 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
Unpatched: 2.11 sec per loop
Patched:   624 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
Unpatched: 1.79 sec per loop
Patched:   281 msec per loop
msg257958 - (view) Author: Ben Hoyt (benhoyt) * Date: 2016-01-11 14:25
Nice! Good to see scandir() getting used to speed up other functions, functions that folks are using a ton already, like glob().
msg258009 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-11 20:41
Looks like switching to scandir will also resolve http://bugs.python.org/issue22167 without further action.
msg258015 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-11 20:56
Not this version of the path. It accumulates all scandir results in a list before yielding values. Otherwise globbing will consume a lot of open file descriptors and could leak them (see issue25994, issue25995).
msg258016 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-11 20:58
Oh well. Too bad. Never mind then.
msg260098 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-11 11:57
The patch is completed now. The with statement is used to avoid FD leaks.
msg273912 - (view) Author: Dima Tisnek (Dima.Tisnek) Date: 2016-08-30 12:18
Who can review Serhiy's patch?

P.S. core of the patch seems good to me, but I'm not an expert on stdlib.
msg274611 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-06 19:33
I think the change in general is approved by GvR. If there are some implementation bugs, we can fix them later.
msg274612 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-06 19:35
New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default':
Issue #25596: Optimized glob() and iglob() functions in the
https://hg.python.org/cpython/rev/cb7ee9d9cddd
msg274696 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-09-07 01:26
Thanks Serhiy!

--Guido (mobile)

On Sep 6, 2016 12:35 PM, "Roundup Robot" <report@bugs.python.org> wrote:

>
> Roundup Robot added the comment:
>
> New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default':
> Issue #25596: Optimized glob() and iglob() functions in the
> https://hg.python.org/cpython/rev/cb7ee9d9cddd
>
> ----------
> nosy: +python-dev
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25596>
> _______________________________________
>
msg274770 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-07 06:50
New changeset 48e573e0a610 by Serhiy Storchaka in branch 'default':
Issue #25596: Falls back to listdir in glob for bytes paths on Windows.
https://hg.python.org/cpython/rev/48e573e0a610
msg274783 - (view) Author: Dima Tisnek (Dima.Tisnek) Date: 2016-09-07 09:09
@Serhiy please comment the implications / limitations of the fallback on Windows.

Is it that scandir cannot handle bytes argument only?
If argument is unicode, but response set contains bytes paths, will that work?
msg274794 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-07 11:10
os.scandir() cannot handle bytes argument on Windows. If an argument is string, os.scandir() yields entries with string names, if an argument is bytes object, os.scandir() yields entries with bytes names. Opened issue27998 for adding support of bytes paths in os.scandir(). After this will be implemented, a workaround cb7ee9d9cddd can be reverted.
History
Date User Action Args
2017-02-07 19:05:40serhiy.storchakalinkissue19240 superseder
2016-09-07 11:10:52serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg274794

stage: patch review -> resolved
2016-09-07 09:09:36Dima.Tisneksetmessages: + msg274783
2016-09-07 06:50:02python-devsetmessages: + msg274770
2016-09-07 01:26:13gvanrossumsetmessages: + msg274696
2016-09-06 19:35:28python-devsetnosy: + python-dev
messages: + msg274612
2016-09-06 19:33:31serhiy.storchakasetmessages: + msg274611
2016-08-30 12:18:16Dima.Tisneksetnosy: + Dima.Tisnek
messages: + msg273912
2016-02-11 11:57:18serhiy.storchakasetfiles: + glob_scandir_5.patch

messages: + msg260098
2016-01-11 20:58:48gvanrossumsetmessages: + msg258016
2016-01-11 20:56:44serhiy.storchakasetmessages: + msg258015
2016-01-11 20:41:10gvanrossumsetmessages: + msg258009
2016-01-11 14:25:58benhoytsetmessages: + msg257958
2016-01-11 12:22:00serhiy.storchakasettitle: Use scandir() to speed the glob module -> Use scandir() to speed up the glob module
2016-01-11 12:18:20serhiy.storchakasetmessages: + msg257954
2016-01-11 08:33:02serhiy.storchakasetfiles: + glob_scandir_4.patch

messages: + msg257953
title: regular files handled as directories in the glob module -> Use scandir() to speed the glob module
2016-01-11 03:14:18gvanrossumsetmessages: + msg257947
2016-01-10 21:09:31xdegayesetmessages: + msg257938
2016-01-10 20:04:58xdegayesetmessages: + msg257930
2016-01-10 14:53:47xdegayesetfiles: + glob_scandir_3.patch
2016-01-10 14:53:35xdegayesetfiles: + glob_scandir_2_diff.patch

messages: + msg257912
2016-01-08 13:53:03xdegayesetfiles: + glob_scandir_doc.patch

messages: + msg257758
2016-01-07 12:46:07serhiy.storchakalinkissue26032 dependencies
2016-01-07 04:11:13r.david.murraysetnosy: - r.david.murray
2016-01-06 23:08:37gvanrossumsetnosy: + gvanrossum
messages: + msg257659
2016-01-06 22:59:13hayposetmessages: + msg257655
2016-01-04 11:20:36xdegayesetmessages: + msg257456
2016-01-03 17:20:34serhiy.storchakasetmessages: + msg257424
2016-01-03 16:30:28xdegayesetmessages: + msg257423
2016-01-03 14:55:45xdegayesetmessages: + msg257421
2016-01-03 08:23:29serhiy.storchakasetfiles: + glob_scandir_2.patch

dependencies: + File descriptor leaks in os.scandir()
messages: + msg257403
2016-01-02 19:49:16r.david.murraysetnosy: + r.david.murray
messages: + msg257354
2016-01-02 15:52:32serhiy.storchakasetdependencies: + Avoid using private function glob.glob1() in msi module and tools
2016-01-02 15:22:53serhiy.storchakasetfiles: + glob_scandir.patch
priority: low -> normal

nosy: + haypo, benhoyt
messages: + msg257347
2015-11-10 15:58:18serhiy.storchakasetpriority: normal -> low
assignee: serhiy.storchaka

nosy: + serhiy.storchaka
stage: patch review
2015-11-10 15:35:35xdegayesetfiles: + glob_isdir_2.patch

messages: + msg254449
2015-11-10 11:18:07xdegayecreate