Message 164141 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	neologix
Recipients	larry, neologix, serhiy.storchaka
Date	2012-06-27.10:20:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1340792453.46.0.683006432244.issue15200@psf.upfronthosting.co.za>
In-reply-to

Content
> On the other hand, fwalk also uses a lot of file descriptors. Users > with processes which were already borderline on max file descriptors > might not appreciate upgrading to find their os.walk calls suddenly > failing. It doesn't have to. Right now, it uses O(depth of the directory tree) FDs. It can be changed to only require O(1) FDs, see http://bugs.python.org/issue13734. For example, GNU coreutils "rm -rf" uses at() syscalls and only requires a constant number of FDs. > Can you figure out why fwalk is faster, and apply that advantage to > walk without* consuming so many file descriptors? I didn't run any benchmark or test, but one reason why fwalk() is faster could be simply because it doesn't do as much path resolution - which is a somewhat expensive operation - thanks to the relative FD being passed. I guess your mileage will vary with the FS in use, and the kernel version (there's been a lot of work to speed up path resolution by Nick Piggin during the last years or so). Anyway, I think that such optimization is useless, because this micro-benchmark doesn't make much sense: when you walk a directory tree, it's usually to do something with the files/directories encountered, and as soon as you do something with them - stat(), unlink(), etc - the gain on the walking time will become negligible.

> On the other hand, fwalk also uses a lot of file descriptors.  Users 
> with processes which were already borderline on max file descriptors 
> might not appreciate upgrading to find their os.walk calls suddenly 
> failing.

It doesn't have to.
Right now, it uses O(depth of the directory tree) FDs. It can be changed to only require O(1) FDs, see http://bugs.python.org/issue13734.
For example, GNU coreutils "rm -rf" uses *at() syscalls and only requires a constant number of FDs.

> Can you figure out why fwalk is faster, and apply that advantage to 
> walk *without* consuming so many file descriptors?

I didn't run any benchmark or test, but one reason why fwalk() is faster could be simply because it doesn't do as much path resolution - which is a somewhat expensive operation - thanks to the relative FD being passed.
I guess your mileage will vary with the FS in use, and the kernel version (there's been a lot of work to speed up path resolution by Nick Piggin during the last years or so).

Anyway, I think that such optimization is useless, because this micro-benchmark doesn't make much sense: when you walk a directory tree, it's usually to do something with the files/directories encountered, and as soon as you do something with them - stat(), unlink(), etc - the gain on the walking time will become negligible.

History
Date	User	Action	Args
2012-06-27 10:20:53	neologix	set	recipients: + neologix, larry, serhiy.storchaka
2012-06-27 10:20:53	neologix	set	messageid: <1340792453.46.0.683006432244.issue15200@psf.upfronthosting.co.za>
2012-06-27 10:20:52	neologix	link	issue15200 messages
2012-06-27 10:20:52	neologix	create