classification
Title: Add stat caching option to pathlib
Type: enhancement Stage: test needed
Components: Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, gvanrossum, pitrou
Priority: normal Keywords: patch

Created on 2016-01-06 22:41 by gvanrossum, last changed 2016-01-07 17:53 by brett.cannon.

Files
File name Uploaded Description Edit
statcache.diff gvanrossum, 2016-01-06 22:41 review
Messages (3)
msg257651 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-06 22:41
There are concerns that pathlib is inefficient because it doesn't cache stat() operations. Thus, for example this code calls stat() for each result twice (once internal to the glob, a second time to answer the is_symlink() question):

  p = pathlib.Path('/usr')
  links = [x for x in p.rglob('*') if x.is_symlink()]

I have a tentative patch (without tests). On my Mac it only gives modest speedups (between 5 and 20 percent) but things may be different on other platforms or for applications that make a lot of inquiries about the same path.

The API I am proposing is that by default nothing changes; to benefit from caching you must instantiate a StatCache() object and pass it to Path() constructor calls, e.g. Path('/usr', stat_cache=cache_object). All Path objects derived from this path object will share the cache. To force an uncached Path object you can use Path(p).

The patch is incomplete; there are no tests for the new functionality (though existing tests pass) and __eq__ should be adjusted so that Path objects using different caches always compare unequal.

Question for Antoine: Did you perhaps anticipate a design like this? Each Path instance has an _accessor slot, but there is only one accessor instance defined that is used everywhere (the global _normal_accessor). So you could have avoided a bunch of complexity in the code around setting the proper _accessor unless you were planning to use multiple accessors.
msg257652 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-01-06 22:47
Let me first mention the stat caching question was asked during the PEP discussion and ultimately it was decided it's not a good idea to meld it in the Path design :-)

Early versions of pathlib were more complex as they were able to keep some file descriptors around, for example for openat() support. You can probably find them by digging in the original repo.
msg257658 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-06 23:06
That's fair, though I don't know what kind of caching design was considered
(until Ram's suggestion on python-ideas I had thought the cache would
simply use a slot on the Path instance to hold the stat() result).

If we want pathlib to become pervasive we may have to accept that its
implementation may become more complex in order to support more use cases.
(I'm so far holding off a walk() method, but who knows for how long...)
History
Date User Action Args
2016-01-07 17:53:55brett.cannonsetnosy: + brett.cannon
2016-01-06 23:06:25gvanrossumsetmessages: + msg257658
2016-01-06 22:47:01pitrousetmessages: + msg257652
2016-01-06 22:41:29gvanrossumcreate