Message228854
I cloned https://github.com/benhoyt/scandir. I understand that the --scandir command line option of benchmark.py are these choices:
- generic = call listdir() and then use "yield GenericDirEntry" which caches os.stat() and os.lstat() results
- python = ctypes implemented calling opendir/readdir and yields PosixDirEntry objects which uses d_type field from readdir() in is_dir(), is_file() and is_symlink(). Cache the result of os.stat() and os.lstat()
- c = "scandir_helper" (iterator) implemented in C (Python C API) yielding PosixDirEntry objects (same class than the "python" benchmark)
I checked with an assertion: d_type of readdir() is never DT_UNKNOWN on my Linux Fedora 20. Statistics of PosixDirEntry on my /usr/share tree:
- 155544 PosixDirEntry instances
- fast-path (use d_type) taken 466632 times in is_dir/is_symlink
- slow-path (need to call os.stat or os.lstat) taken 7828 times in is_dir/is_symlink
- os.stat called 7832 times
- os.stat called 0 times
7832 is the number of symbolic links in my /usr/share tree. 95% of entries don't need stat() in scandir.walk() when using readdir().
So is_dir() and is_symlink() are approximatively called 3 times per entry: scandir.walk() calls is_dir() and is_symlink() on each entry, but is_dir() also calls is_symlink() by default (because the default value of the follow_symlinks parameter is True).
I ran benchmark.py on my Linux Fedora 20 (Linux kernel 3.14). I have two HDD configured as RAID0. I don't think that my disk config is revelant: I also have 12 GB of memory, I hope that /usr/share tree is fully cached. For example, "free -m" tells me that 8.8 GB are cached.
The generic implementation looks inefficient: it is 2 times slower. Is there a bug? GenericDirEntry caches os.stat() and os.lstat() result, it should be as fast or faster than os.walk(), no? Or is it the cost of a generator?
The "c" implementation is 35% faster than the "python" implementation (python=1.170 sec, c=0.762 sec).
Result of benchmark:
haypo@smithers$ python3 setup.py build && for scandir in generic python c; do echo; echo "=== $scandir ==="; PYTHONPATH=build/lib.linux-x86_64-3.3/ python3 benchmark.py /usr/share -c $scandir || break; done
running build
running build_py
running build_ext
=== generic ===
Using very slow generic version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.340s, scandir.walk took 2.471s -- 0.5x as fast
=== python ===
Using slower ctypes version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.318s, scandir.walk took 1.170s -- 1.1x as fast
=== c ===
Using fast C version of scandir
Comparing against builtin version of os.walk()
Priming the system's cache...
Benchmarking walks on /usr/share, repeat 1/3...
Benchmarking walks on /usr/share, repeat 2/3...
Benchmarking walks on /usr/share, repeat 3/3...
os.walk took 1.317s, scandir.walk took 0.762s -- 1.7x as fast |
|
Date |
User |
Action |
Args |
2014-10-09 10:56:16 | vstinner | set | recipients:
+ vstinner, pitrou, giampaolo.rodola, tim.golden, benhoyt, abacabadabacaba, akira, socketpair |
2014-10-09 10:56:16 | vstinner | set | messageid: <1412852176.42.0.317896048897.issue22524@psf.upfronthosting.co.za> |
2014-10-09 10:56:16 | vstinner | link | issue22524 messages |
2014-10-09 10:56:15 | vstinner | create | |
|