This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients abacabadabacaba, akira, benhoyt, giampaolo.rodola, pitrou, socketpair, tim.golden, vstinner
Date 2014-10-09.10:56:15
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1412852176.42.0.317896048897.issue22524@psf.upfronthosting.co.za>
In-reply-to
Content
I cloned https://github.com/benhoyt/scandir. I understand that the --scandir command line option of benchmark.py are these choices:

- generic = call listdir() and then use "yield GenericDirEntry" which caches os.stat() and os.lstat() results
- python = ctypes implemented calling opendir/readdir and yields PosixDirEntry objects which uses d_type field from readdir() in is_dir(), is_file() and is_symlink(). Cache the result of os.stat() and os.lstat()
- c = "scandir_helper" (iterator) implemented in C (Python C API) yielding PosixDirEntry objects (same class than the "python" benchmark)


I checked with an assertion: d_type of readdir() is never DT_UNKNOWN on my Linux Fedora 20. Statistics of PosixDirEntry on my /usr/share tree:

- 155544 PosixDirEntry instances
- fast-path (use d_type) taken 466632 times in is_dir/is_symlink
- slow-path (need to call os.stat or os.lstat) taken 7828 times in is_dir/is_symlink
- os.stat called 7832 times
- os.stat called 0 times

7832 is the number of symbolic links in my /usr/share tree. 95% of entries don't need stat() in scandir.walk() when using readdir().

So is_dir() and is_symlink() are approximatively called 3 times per entry: scandir.walk() calls is_dir() and is_symlink() on each entry, but is_dir() also calls is_symlink() by default (because the default value of the follow_symlinks parameter is True).


I ran benchmark.py on my Linux Fedora 20 (Linux kernel 3.14). I have two HDD configured as RAID0. I don't think that my disk config is revelant: I also have 12 GB of memory, I hope that /usr/share tree is fully cached. For example, "free -m" tells me that 8.8 GB are cached.

The generic implementation looks inefficient: it is 2 times slower. Is there a bug? GenericDirEntry caches os.stat() and os.lstat() result, it should be as fast or faster than os.walk(), no? Or is it the cost of a generator?

The "c" implementation is 35% faster than the "python" implementation (python=1.170 sec, c=0.762 sec).


Result of benchmark:

haypo@smithers$ python3 setup.py build && for scandir in generic python c; do echo; echo "=== $scandir ==="; PYTHONPATH=build/lib.linux-x86_64-3.3/ python3 benchmark.py /usr/share -c $scandir || break; done
running build
running build_py
running build_ext

=== generic ===
Using very slow generic version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.340s, scandir.walk took 2.471s -- 0.5x as fast

=== python ===
Using slower ctypes version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.318s, scandir.walk took 1.170s -- 1.1x as fast

=== c ===
Using fast C version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.317s, scandir.walk took 0.762s -- 1.7x as fast
History
Date User Action Args
2014-10-09 10:56:16vstinnersetrecipients: + vstinner, pitrou, giampaolo.rodola, tim.golden, benhoyt, abacabadabacaba, akira, socketpair
2014-10-09 10:56:16vstinnersetmessageid: <1412852176.42.0.317896048897.issue22524@psf.upfronthosting.co.za>
2014-10-09 10:56:16vstinnerlinkissue22524 messages
2014-10-09 10:56:15vstinnercreate