Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster os.walk #59405

Closed
serhiy-storchaka opened this issue Jun 27, 2012 · 7 comments
Closed

Faster os.walk #59405

serhiy-storchaka opened this issue Jun 27, 2012 · 7 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@serhiy-storchaka
Copy link
Member

BPO 15200
Nosy @pitrou, @larryhastings, @serhiy-storchaka
Files
  • faster_walk.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-10-17.13:20:58.981>
    created_at = <Date 2012-06-27.08:11:22.025>
    labels = ['library', 'performance']
    title = 'Faster os.walk'
    updated_at = <Date 2012-10-17.13:20:58.979>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2012-10-17.13:20:58.979>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-10-17.13:20:58.981>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2012-06-27.08:11:22.025>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['26175']
    hgrepos = []
    issue_num = 15200
    keywords = ['patch']
    message_count = 7.0
    messages = ['164127', '164137', '164141', '164143', '164170', '164172', '173170']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'larry', 'Arfrever', 'neologix', 'rosslagerwall', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue15200'
    versions = ['Python 3.4']

    @serhiy-storchaka
    Copy link
    Member Author

    Using os.fwalk (if it is available) we can make os.walk more fast.

    Microbenchmark:
    ./python -m timeit -s "from os import walk" "for x in walk('Lib'): pass"

    Results:
    Vanilla: 112 msec
    Patched: 90.5 msec

    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jun 27, 2012
    @larryhastings
    Copy link
    Contributor

    It's amusing that using fwalk and throwing away the last argument is faster than a handwritten implementation. On the other hand, fwalk also uses a lot of file descriptors. Users with processes which were already borderline on max file descriptors might not appreciate upgrading to find their os.walk calls suddenly failing.

    Can you figure out why fwalk is faster, and apply that advantage to walk *without* consuming so many file descriptors?

    No rush... :)

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Jun 27, 2012

    On the other hand, fwalk also uses a lot of file descriptors. Users
    with processes which were already borderline on max file descriptors
    might not appreciate upgrading to find their os.walk calls suddenly
    failing.

    It doesn't have to.
    Right now, it uses O(depth of the directory tree) FDs. It can be changed to only require O(1) FDs, see http://bugs.python.org/issue13734.
    For example, GNU coreutils "rm -rf" uses *at() syscalls and only requires a constant number of FDs.

    Can you figure out why fwalk is faster, and apply that advantage to
    walk *without* consuming so many file descriptors?

    I didn't run any benchmark or test, but one reason why fwalk() is faster could be simply because it doesn't do as much path resolution - which is a somewhat expensive operation - thanks to the relative FD being passed.
    I guess your mileage will vary with the FS in use, and the kernel version (there's been a lot of work to speed up path resolution by Nick Piggin during the last years or so).

    Anyway, I think that such optimization is useless, because this micro-benchmark doesn't make much sense: when you walk a directory tree, it's usually to do something with the files/directories encountered, and as soon as you do something with them - stat(), unlink(), etc - the gain on the walking time will become negligible.

    @larryhastings
    Copy link
    Contributor

    It doesn't have to.
    Right now, it uses O(depth of the directory tree) FDs.
    It can be changed to only require O(1) FDs

    But closing and reopening those file descriptors seems like it might slow it down; would it still be a performance win?

    Also, I'm not a security expert, but would the closing/reopening allow the possibility of timing attacks? If so, that might still be okay for walk which makes no guarantees about safety. (But obviously it would be unacceptable for fwalk.)

    Anyway, I think that such optimization is useless, because this
    micro-benchmark doesn't make much sense: when you walk a
    directory tree, it's usually to do something with the
    files/directories encountered, and as soon as you do something
    with them - stat(), unlink(), etc - the gain on the walking
    time will become negligible.

    I'm not sure that "usually" is true here. I suggest that "usually" people use os.walk to find *particular files* in a directory tree, generally by filename. So most of the time os.walk really is quickly iterating over directory trees doing very little.

    I think 20% is a respectable gain, and it's hard for me to say "no" to functions that make Python faster for free. (Well, for the possible cost of a slightly more expensive algorithm.) So I'm +x for now.

    @rosslagerwall
    Copy link
    Mannequin

    rosslagerwall mannequin commented Jun 27, 2012

    This looks like the kind of optimization that depends hugely on what kernel you're using. Maybe on FreeBSD/Solaris/whatever, standard os.walk() is faster?

    If this micro-optimization were to be accepted, someone would have to be keen enough to test it is different ways on many different versions of every platform to conclusively prove that it is faster in general.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 27, 2012

    This looks like the kind of optimization that depends hugely on what
    kernel you're using.

    Agreed.
    Also, I'm worried that there might be subtle differences between walk() and fwalk() which could come and bite users if we silently redirect the former to the latter.

    (for the record, I get a 15% speedup on this Linux box)

    @serhiy-storchaka
    Copy link
    Member Author

    Timing of walk depends on how deep we dive into the directories.

    $ ./python -m timeit -s "from os import walk" "for x in walk('/home/serhiy/py/1/2/3/4/5/6/7/8/9/cpython/'): pass"
    10 loops, best of 3: 398 msec per loop
    $ ./python -m timeit -s "from os import fwalk" "for x in fwalk('/home/serhiy/py/1/2/3/4/5/6/7/8/9/cpython/'): pass"
    10 loops, best of 3: 249 msec per loop

    Given the above mentioned objections (consuming a lot of file descriptors, OS/FS dependency, testing burden) I withdraw my patch and close the issue. Thanks all for discussion.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants