Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scandir() to speed up pathlib globbing #70220

Closed
gvanrossum opened this issue Jan 6, 2016 · 11 comments
Closed

Use scandir() to speed up pathlib globbing #70220

gvanrossum opened this issue Jan 6, 2016 · 11 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@gvanrossum
Copy link
Member

BPO 26032
Nosy @gvanrossum, @brettcannon, @pitrou, @benhoyt, @ethanfurman, @serhiy-storchaka, @barneygale
PRs
  • bpo-43012: remove pathlib._Accessor #25701
  • Dependencies
  • bpo-25596: Use scandir() to speed up the glob module
  • bpo-25994: File descriptor leaks in os.scandir()
  • Files
  • pathlib_glob_scandir.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-09-07.11:18:35.524>
    created_at = <Date 2016-01-06.22:55:27.013>
    labels = ['library', 'performance']
    title = 'Use scandir() to speed up pathlib globbing'
    updated_at = <Date 2021-05-14.08:09:53.071>
    user = 'https://github.com/gvanrossum'

    bugs.python.org fields:

    activity = <Date 2021-05-14.08:09:53.071>
    actor = 'vstinner'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-09-07.11:18:35.524>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2016-01-06.22:55:27.013>
    creator = 'gvanrossum'
    dependencies = ['25596', '25994']
    files = ['41577']
    hgrepos = []
    issue_num = 26032
    keywords = ['patch']
    message_count = 11.0
    messages = ['257653', '257654', '257656', '257657', '257660', '257664', '257665', '257679', '257955', '259283', '274776']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'brett.cannon', 'pitrou', 'benhoyt', 'ethan.furman', 'python-dev', 'serhiy.storchaka', 'barneygale']
    pr_nums = ['25701']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26032'
    versions = ['Python 3.6']

    @gvanrossum
    Copy link
    Member Author

    The globbing functionality in pathlib (Path.glob() and Path.rglob()) might benefit from using the new optimized os.scandir() interface. It currently just uses os.listdir(). The Path.iterdir() method might also benefit (though less so).

    There's also a sideways connection with http://bugs.python.org/issue26031 (adding an optional stat cache) -- the cache could possibly keep the DirEntry objects and use their (hopefully cached) attributes. This is more speculative though (and what if the platform's DirEntry doesn't cache?)

    @gvanrossum gvanrossum added the performance Performance or resource usage label Jan 6, 2016
    @vstinner
    Copy link
    Member

    vstinner commented Jan 6, 2016

    Related issue: issue bpo-25596 "regular files handled as directories in the glob module".

    @ethanfurman
    Copy link
    Member

    As I recall, if the platform's DirEntry doesn't provide the cacheable attributes when first called, those attributes will be looked up (and cached) on first access.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 6, 2016

    As I recall, if the platform's DirEntry doesn't provide the cacheable attributes when first called, those attributes will be looked up (and cached) on first access.

    scandir() is not magic. It simply provides info given by the OS: see readdir() on UNIX and FindFirstFile()/FindNextFile() on Windows.

    DirEntry calls os.stat() if needed, but it caches the result.

    DirEntry doc tries to explain when syscalls or required or not, depending on the requested information and the platform:
    https://docs.python.org/dev/library/os.html#os.DirEntry

    @gvanrossum
    Copy link
    Member Author

    The DirEntry docs say for most methods "In most cases, no system call is
    required" which is pretty non-committal. :-( The only firm promise is for
    inode(), which is pretty useless.

    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Jan 7, 2016

    Guido, it's true that in almost all cases you get the speedup (no system call), and it's very much worth using. But the idea with the docs being non-committal is because being specific would make the docs fairly complex. I believe it's as follows for is_file/is_dir/is_symlink:

    • no system call required on Windows or Unix if the entry is not a symlink
    • unless you're on Unix with some different file system (maybe a network FS?) where d_type is DT_UNKNOWN
    • some other edge case which I've probably forgotten :-)

    Do you think the docs should try to make this more specific?

    @gvanrossum
    Copy link
    Member Author

    Ben, I think it's worth calling out what the rules are around symlinks. I'm
    guessing the info that is initially present is a subset of lstat(), so if
    that indicates it's a symlink, is_dir() and is_file() will need a stat()
    call, *unless* follow_symlinks is False.

    Another question: for symlinks, there are two different possible stat
    results: one for stat() and one for lstat(). Are these both cached
    separately? Or is only one of them? (Experimentally, they are either both
    cached or the cache remembers the follow_symlinks flag and re-fetches the
    other result.)

    Related, "this method always requires a system call", that remark seems to
    disregard the cache.

    I'd be happy to review a doc update patch if you make one.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 7, 2016

    "Another question: for symlinks, there are two different possible stat
    results: one for stat() and one for lstat(). Are these both cached
    separately?"

    Hopefully, both are cached. It's directly the result of stat() and
    stat(follow_symlinks=False) which are cached (so a os.stat_result
    object).

    @serhiy-storchaka serhiy-storchaka added the stdlib Python modules in the Lib dir label Jan 7, 2016
    @serhiy-storchaka
    Copy link
    Member

    Proposed minimal patch implements globbing in pathlib using os.scandir(). Here are results of microbenchmarks:

    $ ./python -m timeit -s "from pathlib import Path; p = Path()" -- "list(p.glob('**/*'))"
    Unpatched: 598 msec per loop
    Patched:   372 msec per loop
    
    $ ./python -m timeit -s "from pathlib import Path; p = Path('/usr/')" -- "list(p.glob('lib*/**/*'))"
    Unpatched: 1.33 sec per loop
    Patched:   804 msec per loop
    
    $ ./python -m timeit -s "from pathlib import Path; p = Path('/usr/')" -- "list(p.glob('lib*/**/'))"
    Unpatched: 750 msec per loop
    Patched:   180 msec per loop

    See msg257954 in bpo-25596 for comparison with the glob module.

    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 11, 2016
    @benhoyt
    Copy link
    Mannequin

    benhoyt mannequin commented Jan 31, 2016

    Guido, I've made some tweaks and improvements to the DirEntry docs here: http://bugs.python.org/issue26248 -- the idea is to fix the issues you mentioned to clarify when system calls are required with symlinks, mentioning that the results are cached separately for follow_symlinks True and False, etc.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 7, 2016

    New changeset 927665c4aaab by Serhiy Storchaka in branch 'default':
    Issue bpo-26032: Optimized globbing in pathlib by using os.scandir(); it is now
    https://hg.python.org/cpython/rev/927665c4aaab

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants