Message 304305 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	akvadrako, brett.cannon, eric.snow, ncoghlan
Date	2017-10-13.05:00:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1507870848.64.0.213398074469.issue31772@psf.upfronthosting.co.za>
In-reply-to

Content
Increasing the number of stat calls required for a successful import is a good reason to close the submitted PR, but I'm not sure it's a good reason to close the issue, as there may be other ways to solve it that don't result in an extra stat call for every successful cache hit. Restating the problem: the pyc file format currently discards the fractional portion of the source file mtime. This means that even if the source filesystem offers a better than 1 second timestamp resolution, the bytecode cache doesn't. So I think it's worth asking ourselves what would happen if, instead of storing the source mtime as an integer directly, we instead stored "int(mtime * N) & 0xFFFF". The source timestamp is stored in a 32-bit field, so the current pyc format is technically already subject to a variant of the 2038 epoch problem (i.e. it will wrap in 2106 and start re-using timestamps). We just don't care, since the only impact is that there's a tiny risk that we'll fail to recompile an updated source file if it hasn't changed size and we try importing it at exactly the wrong time. That window is currently 1 second every ~136 years. That means we have a trade-off available between the size of each individual "erroneous cache hit" window, and how often we encounter that window. Some examples: N=2: 500 ms window every ~68 years N=10: 100 ms window every ~13.6 years N=100: 10 ms window every ~1.36 years N=1000: 1 ms window every ~7 weeks (~0.136 years) The odds of a file being in exactly 7 weeks time after it was last compiled (down to the millisecond) and being different without changing size are going to be lower that those of a single (or N) character change being made right now (e.g. fixing a typo in a variable name that transposed characters, or got a letter wrong). A case where problems with the status quo could be most plausibly encountered is when a text editor with autosave configured is combined with a testing web service with hot reloading configured. Don't get me wrong, I think the odds of that actually happening are already very low, and the human fix is simple (make another edit, save the source file again, and grumble about computers not seeing changes that are right in front of them). However, encountering problems with an N=100 or N=1000 multiplier seems even more implausible to me, and in cases where it was deemed a concern, PEP 552's hash-based caching seems a solution people should be looking at anyway.

Increasing the number of stat calls required for a successful import is a good reason to close the submitted PR, but I'm not sure it's a good reason to close the *issue*, as there may be other ways to solve it that don't result in an extra stat call for every successful cache hit.

Restating the problem: the pyc file format currently discards the fractional portion of the source file mtime. This means that even if the source filesystem offers a better than 1 second timestamp resolution, the bytecode cache doesn't.

So I think it's worth asking ourselves what would happen if, instead of storing the source mtime as an integer directly, we instead stored "int(mtime * N) & 0xFFFF".

The source timestamp is stored in a 32-bit field, so the current pyc format is technically already subject to a variant of the 2038 epoch problem (i.e. it will wrap in 2106 and start re-using timestamps). We just don't care, since the only impact is that there's a tiny risk that we'll fail to recompile an updated source file if it hasn't changed size and we try importing it at exactly the wrong time. That window is currently 1 second every ~136 years.

That means we have a trade-off available between the size of each individual "erroneous cache hit" window, and how often we encounter that window. Some examples:

N=2: 500 ms window every ~68 years
N=10: 100 ms window every ~13.6 years
N=100: 10 ms window every ~1.36 years
N=1000: 1 ms window every ~7 weeks (~0.136 years)

The odds of a file being in exactly 7 weeks time after it was last compiled (down to the millisecond) *and* being different without changing size are going to be lower that those of a single (or N) character change being made *right now* (e.g. fixing a typo in a variable name that transposed characters, or got a letter wrong).

A case where problems with the status quo could be most plausibly encountered is when a text editor with autosave configured is combined with a testing web service with hot reloading configured.

Don't get me wrong, I think the odds of that actually happening are already very low, and the human fix is simple (make another edit, save the source file again, and grumble about computers not seeing changes that are right in front of them).

However, encountering problems with an N=100 or N=1000 multiplier seems even more implausible to me, and in cases where it was deemed a concern, PEP 552's hash-based caching seems a solution people should be looking at anyway.

History
Date	User	Action	Args
2017-10-13 05:00:49	ncoghlan	set	recipients: + ncoghlan, brett.cannon, eric.snow, akvadrako
2017-10-13 05:00:48	ncoghlan	set	messageid: <1507870848.64.0.213398074469.issue31772@psf.upfronthosting.co.za>
2017-10-13 05:00:48	ncoghlan	link	issue31772 messages
2017-10-13 05:00:47	ncoghlan	create