classification
Title: Add "time zone index" cache to datetime objects
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ericvw, lemburg, p-ganssle, vstinner
Priority: normal Keywords: patch

Created on 2019-01-11 20:34 by p-ganssle, last changed 2019-01-14 13:59 by p-ganssle.

Pull Requests
URL Status Linked Edit
PR 11529 open p-ganssle, 2019-01-11 20:37
Messages (4)
msg333503 - (view) Author: Paul Ganssle (p-ganssle) * (Python triager) Date: 2019-01-11 20:34
When examining the performance characteristics of pytz, I realized that pytz's eager calculation of tzname, offset and DST gives it an implicit cache that makes it much faster for repeated queries to .utcoffset(), .dst() and/or .tzname()  though the eager calculation means that it's slower to create an aware datetime that never calculates those functions - see my blog post "pytz: The Fastest Footgun in the West" [1].

I do not think that datetime should move to eager calculations (for one thing it would be a pretty major change), but I did come up with a modest change that can make it possible to implement a pythonic time zone provider without taking the performance hit: introducing a small, optional, set-once cache for "time zone index" onto the datetime object. The idea takes advantage of the fact that essentially all time zones can be described by a very small number of (offset, tzname, dst) combinations plus a function to describe which one applies. Time zone implementations can store these offsets in one or more indexable containers and implement a `tzidx(self, dt)` method returning the relevant index as a function of the datetime. We would provide a per-datetime cache by implementing a datetime.tzidx(self) method, which would be a memoized call to `self.tzinfo.tzidx()`, like this (ignoring error handling - a more detailed implementation can be found in the PoC PR):

def tzidx(self):
    if self._tzidx != 0xff:
        return self._tzidx

    tzidx = self.tzinfo.tzidx(self)
    if isinstance(tzidx, int) and 0 <= tzidx < 255:
        self._tzidx = tzidx
    return tzidx

And then `utcoffset(self, dt)`, `dst(self, dt)` and `tzname(self, dt)` could be implemented in terms of `dt.tzidx()`! This interface would be completely opt-in, and `tzinfo.tzidx` would have no default implementation.

Note that I have used 0xff as the signal value here - this is because I propose that the `tzidx` cache be limited to *only* integers in the interval [0, 255), with 255 reserved as the "not set" value. It is exceedingly unlikely that a given time zone will have more than 255 distinct values in its index, and even if it does, this implementation gracefully falls back to "every call is a cache miss".

In my tests, using a single unsigned char for `tzidx` does not increase the size of the `PyDateTime` struct, because it's using a byte that is currently part of the alignment padding anyway. This same trick was used to minimize the impact of `fold`, and I figure it's better to be conservative and enforce 0 <= tzidx < 255, since we can only do it so many times.

The last thing I'd like to note is the problem of mutability - datetime objects are supposed to be immutable, and this cache value actually mutates the datetime struct! While it's true that the in-memory value of the datetime changes, the fundamental concept of immutability is retained, since this does not affect any of the qualities of the datetime observable via the public API.

In fact (and I hope this is not too much of a digression), it is already unfortunately true that datetimes are more mutable than they would seem, because nothing prevents `tzinfo` objects from returning different values on subsequent calls to the timezone-lookup functions. What's worse, datetime's hash implementation takes its UTC offset into account! In practice it's rare for a tzinfo to ever return a different value for utcoffset(dt), but one prominent example where this could be a problem is with something like `dateutil.tz.tzlocal`, which is a local timezone object written in terms of the `time` module's time zone information - which can change if the local timezone information changes over the course of a program's run.

This change does not necessarily fix that problem or start enforcing immutability of `utcoffset`, but it does encourage a design that is *less susceptible* to these problems, since even if the return value of `tzinfo.tzidx()` changes over time for some reason, that function would only be called once per datetime.

I have an initial PoC for this implemented, and I've tested it out with an accompanying implementation of `dateutil.tz` that makes use of it, it does indeed make things much faster. I look forward to your comments.

1. https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html
msg333504 - (view) Author: Paul Ganssle (p-ganssle) * (Python triager) Date: 2019-01-11 20:46
One other thing I might mention here is that I did explore the idea of storing this cache on the tzinfo implementation itself, but it is problematic for a number of reasons:

1. It would either need to use some sort of expiring cache (lru, ttl) or require a great deal of memory, greatly reducing the utility - the proposed implementation requires no additional memory.
2. Because the implementation of datetime.__hash__ invokes utcoffset(), it is impossible to implement utcoffset in terms of a dictionary of tz-aware datetimes. This means that you need to construct a new, naive datetime, which is a fairly slow operation and really puts a damper in the utility of the cache.

If I were designing this from scratch, I would probably implement it differently, but I think this approach is pretty close to ideal given the constraints of backwards compatibility.
msg333615 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-14 12:57
I dislike adding a public API for an optimization. Would it be possible to make it private? Would it make sense? tzidx => _tzidx.

> One other thing I might mention here is that I did explore the idea of storing this cache on the tzinfo implementation itself, but it is problematic for a number of reasons:
>
> 1. It would either need to use some sort of expiring cache (lru, ttl) or require a great deal of memory, greatly reducing the utility - the proposed implementation requires no additional memory.

In test of your PR, tzinfo allocates memory for its cache: 

            offsets = [timedelta(hours=0), timedelta(hours=1)]
            names = ['+00:00', '+01:00']
            dsts = [timedelta(hours=0), timedelta(hours=1)]

This memory isn't free. I don't see how using an index completely prevents the need to allocate memory for a cache.

Somehow, we need a method to clear the cache and decide a caching policy. The simplest policy is to have no limit. The re.compile() uses a cache of 512 entries. functools.lru_cache uses a default limit of 128 entries.

Instead of adding a new API, would it be possible to reuse functools.lru_cache somehow?

> 2. Because the implementation of datetime.__hash__ invokes utcoffset(), it is impossible to implement utcoffset in terms of a dictionary of tz-aware datetimes. This means that you need to construct a new, naive datetime, which is a fairly slow operation and really puts a damper in the utility of the cache.

For special local timezones, would it be possible to explicitly exclude them, and restrict the cache the simple timespace (fixed offset)?
msg333622 - (view) Author: Paul Ganssle (p-ganssle) * (Python triager) Date: 2019-01-14 13:59
> I dislike adding a public API for an optimization. Would it be possible to make it private? Would it make sense? tzidx => _tzidx.

This also would have been my preference, but it is unfortunately not possible because of the way tzinfo works. tzinfo is an abstract base class, which means that any changes to how `tzinfo` subclasses work *must* be implemented by the third party time zone providers. The tzidx() method is intended to be called *only* from `tzinfo` an optional feature, as such it must be public.

Maybe a better option would be to make it a magic method, to indicate that it is public but not intended to be called directly by users? `__tzidx__` or `__tzcache__`?


> In test of your PR, tzinfo allocates memory for its cache: 

>            offsets = [timedelta(hours=0), timedelta(hours=1)]
>            names = ['+00:00', '+01:00']
>            dsts = [timedelta(hours=0), timedelta(hours=1)]

>This memory isn't free. I don't see how using an index completely prevents the need to allocate memory for a cache.

That part is not free, but that's also not a cache, it's the underlying data for the tzinfo and is not modified over the object's lifetime. Most `tzinfo` subclasses are *already* implemented like this in one way or another, for example `dateutil.tz.tzfile` reads one of the binary tzdata files and produces a list of `ttinfo` objects, then looks up which one applies for each datetime.

The cache part of this is that *that lookup* is expensive, but also super easy to cache. In pytz this cache comes for free, because instead of using a list of offsets and figuring out which one to apply when you call `utcoffset()`, pytz creates a list of `tzinfo` objects with *fixed* offsets and eagerly attaches one to a datetime when you call `pytz.timezone.localize()/normalize()`. It's using the `tzinfo` field as a cache, but at the cost of breaking time zone semantics (which unfortunately use `is` to determine whether two datetimes are in the same zone or not), plus the cost of requiring *eager* resolution to this question rather than lazy.

> Somehow, we need a method to clear the cache and decide a caching policy. The simplest policy is to have no limit. The re.compile() uses a cache of 512 entries. functools.lru_cache uses a default limit of 128 entries.

This is not necessary, actually, because of the way this cache works. This is a "write-once" cache *stored on the datetime itself*, and its cost is built in to the memory cost of the datetime itself. For CPython it will currently consume one of the alignment bytes, so there will be no additional memory use.

> Instead of adding a new API, would it be possible to reuse functools.lru_cache somehow?

If it were possible to use an LRU cache in this situation, it would not be necessary to make a change in CPython at all. The problem with LRU cache or any other cache implemented as a mapping between datetime -> offset are as I mentioned, large memory use and the performance cost of creating an appropriate hash would greatly reduce the utility of this function. The proposed tzidx lets tzinfo implementers opt in to nearly all the advantages of pytz without any of the disadvantages.

>> 2. Because the implementation of datetime.__hash__ invokes utcoffset(), it is impossible to implement utcoffset in terms of a dictionary of tz-aware datetimes. This means that you need to construct a new, naive datetime, which is a fairly slow operation and really puts a damper in the utility of the cache.

> For special local timezones, would it be possible to explicitly exclude them, and restrict the cache the simple timespace (fixed offset)?

I think there are many problems with this approach, but I'm going to hold off on answering this, because I think even if this were possible, an on-datetime cache makes much more sense than a mapping-based cache. If you still want this addressed given my other arguments let me know and I can give you a more thorough response.

Thank you for taking the time to review my proposal, Victor. I realize that it is a bit of a wall of text accompanying a pretty big PR, which is kinda the result of the fact that I had this idea 6+ months ago and I've been privately reworking and refining it (both in a code sandbox and in my head) for some time now, only to dump out the results of all my internal arguments all at once. I appreciate the feedback and I'm happy to do whatever possible to clarify my reasoning.
History
Date User Action Args
2019-01-14 13:59:12p-gansslesetmessages: + msg333622
2019-01-14 12:57:46vstinnersetmessages: + msg333615
2019-01-12 00:01:53ericvwsetnosy: + ericvw
2019-01-11 20:46:52p-gansslesetnosy: + lemburg, belopolsky, vstinner
messages: + msg333504
2019-01-11 20:37:38p-gansslesetkeywords: + patch
stage: patch review
pull_requests: + pull_request11122
2019-01-11 20:34:17p-gansslecreate