This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author benhoyt
Recipients abacabadabacaba, akira, benhoyt, giampaolo.rodola, josh.r, pitrou, socketpair, tebeka, tim.golden, vstinner
Date 2015-02-13.04:39:24
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1423802365.04.0.822831901186.issue22524@psf.upfronthosting.co.za>
In-reply-to
Content
To continue the actual "which implementation" discussion: as I mentioned last week in http://bugs.python.org/msg235458, I think the benchmarks above show pretty clearly we should use the all-C version.

For background: PEP 471 doesn't add any new functionality, and especially with the new pathlib module, it doesn't make directory iteration syntax nicer either: os.scandir() is all about letting the OS give you whatever info it can *for performance*. Most of the Rationale for adding scandir given in PEP 471 is because it can be so so much faster than listdir + stat.

My original all-C implementation is definitely more code to review (roughly 800 lines of C vs scandir-6.patch's 400), but it's also more than twice as fast. On my Windows 7 SSD just now, running benchmark.py:

    Original scandir-2.patch version:
    os.walk took 0.509s, scandir.walk took 0.020s -- 25.4x as fast

    New scandir-6.patch version:
    os.walk took 0.455s, scandir.walk took 0.046s -- 10.0x as fast

So the all-C implementation is literally 2.5x as fast on Windows. (After both tests, just for a sanity check, I ran the ctypes version as well, and it said about 8x as fast for both runs.)

Then on Linux, not a perfect comparison (different benchmarks) but shows the same kind of trend:

    Original scandir-2.patch benchmark (http://bugs.python.org/msg228857):
    os.walk took 0.860s, scandir.walk took 0.268s -- 3.2x as fast

    New scandir-6.patch benchmark (http://bugs.python.org/msg235865) -- note that "1.3x faster" should actually read "1.3x as fast" here:
    bench: 1.3x faster (scandir: 164.9 ms, listdir: 216.3 ms)

So again, the all-C implementation is 2.5x as fast on Linux too.

And on Linux, the incremental improvement provided by scandir-6 over listdir is hardly worth it -- I'd use a new directory listing API for 3.2x as fast, but not for 1.3x as fast.

Admittedly a 10x speed gain (!) on Windows is still very much worth going for, so I'm positive about scandir even with a half-Python implementation, but hopefully the above shows fairly clearly why the all-C implementation is important, especially on Linux.

Also, if the consensus is in favour of slow but less C code, I think there are further tweaks we can make to the Python part of the code to improve things a bit more.
History
Date User Action Args
2015-02-13 04:39:25benhoytsetrecipients: + benhoyt, tebeka, pitrou, vstinner, giampaolo.rodola, tim.golden, abacabadabacaba, akira, socketpair, josh.r
2015-02-13 04:39:25benhoytsetmessageid: <1423802365.04.0.822831901186.issue22524@psf.upfronthosting.co.za>
2015-02-13 04:39:25benhoytlinkissue22524 messages
2015-02-13 04:39:24benhoytcreate