Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scandir() to speed up the glob module #69782

Closed
xdegaye mannequin opened this issue Nov 10, 2015 · 30 comments
Closed

Use scandir() to speed up the glob module #69782

xdegaye mannequin opened this issue Nov 10, 2015 · 30 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@xdegaye
Copy link
Mannequin

xdegaye mannequin commented Nov 10, 2015

BPO 25596
Nosy @gvanrossum, @vstinner, @benhoyt, @xdegaye, @dimaqq, @serhiy-storchaka
Dependencies
  • bpo-16620: Avoid using private function glob.glob1() in msi module and tools
  • bpo-25994: File descriptor leaks in os.scandir()
  • Files
  • glob_isdir.patch
  • glob_isdir_2.patch
  • glob_scandir.patch
  • glob_scandir_2.patch
  • glob_scandir_doc.patch
  • glob_scandir_2_diff.patch
  • glob_scandir_3.patch
  • glob_scandir_4.patch
  • glob_scandir_5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-09-07.11:10:52.544>
    created_at = <Date 2015-11-10.11:18:07.859>
    labels = ['type-feature', 'library']
    title = 'Use scandir() to speed up the glob module'
    updated_at = <Date 2016-09-07.11:10:52.543>
    user = 'https://github.com/xdegaye'

    bugs.python.org fields:

    activity = <Date 2016-09-07.11:10:52.543>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-09-07.11:10:52.544>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2015-11-10.11:18:07.859>
    creator = 'xdegaye'
    dependencies = ['16620', '25994']
    files = ['40996', '40999', '41474', '41481', '41533', '41570', '41571', '41576', '41895']
    hgrepos = []
    issue_num = 25596
    keywords = ['patch']
    message_count = 30.0
    messages = ['254440', '254449', '257347', '257354', '257403', '257421', '257423', '257424', '257456', '257655', '257659', '257758', '257912', '257930', '257938', '257947', '257953', '257954', '257958', '258009', '258015', '258016', '260098', '273912', '274611', '274612', '274696', '274770', '274783', '274794']
    nosy_count = 7.0
    nosy_names = ['gvanrossum', 'vstinner', 'benhoyt', 'xdegaye', 'python-dev', 'Dima.Tisnek', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue25596'
    versions = ['Python 3.6']

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Nov 10, 2015

    The glob module happily joins names of regular files together with os.path.join() or attempts to list the files contained into a regular file (sic). The same 'except os.error' statement is used to handle both these cases and the case of a non readable directory.

    The attached patch makes the code more correct and easier to understand.

    @xdegaye xdegaye mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Nov 10, 2015
    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Nov 10, 2015

    This second patch rewrites the conditionals that decide whether to call _iglob() recursively, in a more natural way and without changing the behavior from the first patch.
    Note that when 'dirname == pathname', then basename is empty and glob2() or glob1() are not called.

    The patch also refactors the code with a new _listdir() function.

    @serhiy-storchaka serhiy-storchaka self-assigned this Nov 10, 2015
    @serhiy-storchaka
    Copy link
    Member

    In general the patch LGTM. It allows to speed up glob by 5-10% in warn test. But much more gain we can achieve by using os.scandir(). Here are results of microbenchmarks:

    $ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
    Unpatched:        201 msec per loop
    Using isdir():    181 msec per loop
    Using scandir():  65.2 msec per loop
    
    $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
    Unpatched:        2.06 sec per loop
    Using isdir():    1.92 sec per loop
    Using scandir():  820 msec per loop
    
    $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
    Unpatched:        1.77 sec per loop
    Using isdir():    1.61 sec per loop
    Using scandir():  431 msec per loop

    Yet one benefit is that iglob() now yields path names without the delay for reading the full content of a directory (see bpo-22167).

    @bitdancer
    Copy link
    Member

    I don't think it is a good idea to break backward compatibility here even if the globN are internal functions. Better would be to continue to have globN functions that support the existing API and emit deprecation warnings.

    @serhiy-storchaka
    Copy link
    Member

    I think so too. I just wanted someone to confirmed that it is not overcautiousness. For now glob1() is used only in one place in the stdlib (bpo-16620).

    But there was other problem with previous patch, the same as with current implementation of os.walk() (bpo-25995). It makes glob to use a lot of file descriptors.

    Updated patch lefts deprecated glob1() and glob2() and makes glob to consume all scandir iterator before starting to yield values (but the problem with fd leaks on non-refcounted implementations is still left, bpo-25994). This doesn't affect performance, but lefts the issue with delaying (bpo-22167).

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 3, 2016

    os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion.
    The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 3, 2016

    I may be missing something, anyway here are few comments on the patch:
    Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ?
    If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)

    @serhiy-storchaka
    Copy link
    Member

    os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion.

    No, os.scandir() is called non-recursively in _iterdir(), and since _iterdir() results are always accumulated in a list, a recursion starts only after exhausting the os.scandir() iterator and closing the file descriptor. We need first to resolve bpo-25994 to close the file descriptor explicitly.

    The glob functions should fail explicitly when scandir() raises OSERROR with posix errno set to EMFILE (The process has too many files open), or even better, silently ignore only the PermissionError exception instead of ignoring OSERROR.

    Patched code passes existing tests. Do you have additional tests?

    Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() follows symlinks ?

    Ah, I thought is_dir() doesn't follow symlinks. Yes, now the call to entry.is_symlink() is not necessary.

    If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then _rlistdir() could use x.is_dir() to know whether it is correct to iterate over _rlistdir(x.path, dironly)

    Yes, but this can complicate the rest of the code. _rlistdir() is called with dironly=False only when the pattern ends with '**'. I'm not sure this is enough important case for optimization. In most cases '**' is used in the middle of the pattern, and all names yielded by _iterdir() are directory names (or broken symlinks).

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 4, 2016

    and since _iterdir() results are always accumulated in a list

    Right, I failed to note that point. And so, since the file descriptor opened by os.scandir() is closed within each call to recursive _rlistdir(), then my other comment about EMFILE does not stand as well.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 6, 2016

    Related issue: bpo-26032 "Use scandir() to speed up pathlib globbing".

    @gvanrossum
    Copy link
    Member

    (IOW once this patch has been applied maybe you can do the same for globbing in pathlib as requested in issue bpo-26032.)

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 8, 2016

    Adding a doc patch.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 10, 2016

    FWIW I have followed the idea of having _iterdir() yielding the DirEntry entry instead of the name in Serhiy's patch. There is a slight performance gain. Now _glob0() and _glob1() do return a list of directories when dironly is true but there is now another place where OSError must be tracked, so it is not clear if this is worth it.

    glob_scandir_2_diff.patch is the differential patch between glob_scandir_2.patch and glob_scandir_3.patch.

    Here are the performance tests run with both patches.

    $ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
    glob_scandir_2.patch: 33.1 msec per loop
    glob_scandir_3.patch: 33.8 msec per loop
    
    $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
    glob_scandir_2.patch: 927 msec per loop
    glob_scandir_3.patch: 850 msec per loop
    
    $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
    glob_scandir_2.patch: 423 msec per loop
    glob_scandir_3.patch: 337 msec per loop

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 10, 2016

    Oops, the list comprehension in _glob1() of glob_scandir_3.patch should be instead: names = [x.name for x in _iterdir(dirname, dironly)]
    This does not change the tests results except it improves glob('/usr/lib*/**/*', True) that gives now 820 msec instead of 927 msec with glob_scandir_2.patch.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jan 10, 2016

    Actually the microbenchmarks comparison between glob_scandir_2.patch and glob_scandir_3.patch must be made by removing the call to entry.is_symlink() also in glob_scandir_2.patch.
    In that case the microbenchmarks give about the same results for each patch.
    Sorry for the noise.

    @gvanrossum
    Copy link
    Member

    @serhiy, I assume this is in your capable hands?

    Personally I'd like to have seen the refactoring in a separate diff from the switch to scandir(), just to ensure that the refactoring does not change anything and to make it easier to read the changes for the scandir() switch.

    @serhiy-storchaka
    Copy link
    Member

    Thank you Xavier for correcting the documentation.

    Here is minimal patch that switches the glob module to scandir().

    @serhiy-storchaka serhiy-storchaka changed the title regular files handled as directories in the glob module Use scandir() to speed the glob module Jan 11, 2016
    @serhiy-storchaka
    Copy link
    Member

    Results of microbenchmarks for glob_scandir_4.patch:

    $ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
    Unpatched: 275 msec per loop
    Patched:   79.4 msec per loop
    
    $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
    Unpatched: 2.11 sec per loop
    Patched:   624 msec per loop
    
    $ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
    Unpatched: 1.79 sec per loop
    Patched:   281 msec per loop

    @serhiy-storchaka serhiy-storchaka changed the title Use scandir() to speed the glob module Use scandir() to speed up the glob module Jan 11, 2016
    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Jan 11, 2016

    Nice! Good to see scandir() getting used to speed up other functions, functions that folks are using a ton already, like glob().

    @gvanrossum
    Copy link
    Member

    Looks like switching to scandir will also resolve http://bugs.python.org/issue22167 without further action.

    @serhiy-storchaka
    Copy link
    Member

    Not this version of the path. It accumulates all scandir results in a list before yielding values. Otherwise globbing will consume a lot of open file descriptors and could leak them (see bpo-25994, bpo-25995).

    @gvanrossum
    Copy link
    Member

    Oh well. Too bad. Never mind then.

    @serhiy-storchaka
    Copy link
    Member

    The patch is completed now. The with statement is used to avoid FD leaks.

    @dimaqq
    Copy link
    Mannequin

    dimaqq mannequin commented Aug 30, 2016

    Who can review Serhiy's patch?

    P.S. core of the patch seems good to me, but I'm not an expert on stdlib.

    @serhiy-storchaka
    Copy link
    Member

    I think the change in general is approved by GvR. If there are some implementation bugs, we can fix them later.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 6, 2016

    New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default':
    Issue bpo-25596: Optimized glob() and iglob() functions in the
    https://hg.python.org/cpython/rev/cb7ee9d9cddd

    @gvanrossum
    Copy link
    Member

    Thanks Serhiy!

    --Guido (mobile)

    On Sep 6, 2016 12:35 PM, "Roundup Robot" <report@bugs.python.org> wrote:

    Roundup Robot added the comment:

    New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default':
    Issue bpo-25596: Optimized glob() and iglob() functions in the
    https://hg.python.org/cpython/rev/cb7ee9d9cddd

    ----------
    nosy: +python-dev


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue25596\>


    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 7, 2016

    New changeset 48e573e0a610 by Serhiy Storchaka in branch 'default':
    Issue bpo-25596: Falls back to listdir in glob for bytes paths on Windows.
    https://hg.python.org/cpython/rev/48e573e0a610

    @dimaqq
    Copy link
    Mannequin

    dimaqq mannequin commented Sep 7, 2016

    @serhiy please comment the implications / limitations of the fallback on Windows.

    Is it that scandir cannot handle bytes argument only?
    If argument is unicode, but response set contains bytes paths, will that work?

    @serhiy-storchaka
    Copy link
    Member

    os.scandir() cannot handle bytes argument on Windows. If an argument is string, os.scandir() yields entries with string names, if an argument is bytes object, os.scandir() yields entries with bytes names. Opened bpo-27998 for adding support of bytes paths in os.scandir(). After this will be implemented, a workaround cb7ee9d9cddd can be reverted.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants