Issue 31772: SourceLoader uses stale bytecode in case of equal mtime seconds

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/75953

classification

Title:	SourceLoader uses stale bytecode in case of equal mtime seconds
Type:	behavior	Stage:	test needed
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	open	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	akvadrako, brett.cannon, eric.snow, iritkatriel, ncoghlan
Priority:	normal	Keywords:	patch

Created on 2017-10-12 13:43 by akvadrako, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 3963	closed	akvadrako, 2017-10-12 13:48

Messages (18)
msg304236 - (view)	Author: Devin Bayer (akvadrako) *	Date: 2017-10-12 13:43
The current import machinery will use stale cached bytecode if the source is modified more than once per second and with equal size. A straightforward fix is to require the bytecode mtime to be greater than the source file mtime. In the unusual case where the file is written twice with the precision the filesystem records, this will ignore the cache, but at least we aren't behaving incorrectly.
msg304279 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2017-10-12 18:32
The problem with that is it will increase the number of stat calls which we have always tried to minimize due to performance worries. https://www.python.org/dev/peps/pep-0552/ has also been accepted which will take care of this specific case. So while I appreciate the work on this, Devin, I'm going to close this in favour of people who have this problem to use deterministic .pyc files instead.
msg304285 - (view)	Author: Devin Bayer (akvadrako) *	Date: 2017-10-12 20:04
Hello Brett, it would be nice if you would discuss this before closing. The referenced PEP will not solve my use case, because hash-based checking will not become the default. The issue is that by default the source file loader will be returning stale bytecode and that's incorrect behavior. If you wish to avoid the stat call you could use a higher resolution timestamp in the pyc file, or better yet add a second field indicating bytecode compilation time.
msg304305 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-10-13 05:00
Increasing the number of stat calls required for a successful import is a good reason to close the submitted PR, but I'm not sure it's a good reason to close the issue, as there may be other ways to solve it that don't result in an extra stat call for every successful cache hit. Restating the problem: the pyc file format currently discards the fractional portion of the source file mtime. This means that even if the source filesystem offers a better than 1 second timestamp resolution, the bytecode cache doesn't. So I think it's worth asking ourselves what would happen if, instead of storing the source mtime as an integer directly, we instead stored "int(mtime * N) & 0xFFFF". The source timestamp is stored in a 32-bit field, so the current pyc format is technically already subject to a variant of the 2038 epoch problem (i.e. it will wrap in 2106 and start re-using timestamps). We just don't care, since the only impact is that there's a tiny risk that we'll fail to recompile an updated source file if it hasn't changed size and we try importing it at exactly the wrong time. That window is currently 1 second every ~136 years. That means we have a trade-off available between the size of each individual "erroneous cache hit" window, and how often we encounter that window. Some examples: N=2: 500 ms window every ~68 years N=10: 100 ms window every ~13.6 years N=100: 10 ms window every ~1.36 years N=1000: 1 ms window every ~7 weeks (~0.136 years) The odds of a file being in exactly 7 weeks time after it was last compiled (down to the millisecond) and being different without changing size are going to be lower that those of a single (or N) character change being made right now (e.g. fixing a typo in a variable name that transposed characters, or got a letter wrong). A case where problems with the status quo could be most plausibly encountered is when a text editor with autosave configured is combined with a testing web service with hot reloading configured. Don't get me wrong, I think the odds of that actually happening are already very low, and the human fix is simple (make another edit, save the source file again, and grumble about computers not seeing changes that are right in front of them). However, encountering problems with an N=100 or N=1000 multiplier seems even more implausible to me, and in cases where it was deemed a concern, PEP 552's hash-based caching seems a solution people should be looking at anyway.
msg304310 - (view)	Author: Devin Bayer (akvadrako) *	Date: 2017-10-13 06:19
Thanks for the support Nick. I think your proposed idea would still result in rare but confusing behavior, which is the type of surprise Python should avoid. The hash-based pyc files doesn't seem like a solution to me, because it won't be enabled by default. And I think it's obvious the performance loss of doing so is unacceptable. If changing the bytecode format is unacceptable, though it seems like the cleanest answer, the import machinery could just avoid caching bytecode when the int(mtime) of the source file is potentially in the future.
msg304311 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-10-13 06:47
The problem with changing the bytecode format is that code other than the import machinery reads the bytecode headers, so when we change the format, we need to consider the impact on that code. (Even my multiplication proposal above suffers from that problem) A freshness check that would avoid the extra stat call, while still making the import system skeptical of the validity of the bytecode cache for just-edited sources would be to also check the source mtime against the current time: if they're the same within the resolution of the bytecode format (i.e. 1 second), then compile the code again even if the bytecode headers claims it's already up to date. That way hot reloaders would be sure to pick up multiple edits to the source file properly, and would reliably be left with the final version loaded, rather than the first version from the final second of edits.
msg304312 - (view)	Author: Devin Bayer (akvadrako) *	Date: 2017-10-13 06:51
That wouldn't always work either. If the source file is imported, then edited, then not reimported until the next second (or far in the future) the stale bytecode would still be used.
msg304314 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-10-13 07:11
If there's no hot reloader forcing a reimport for every saved edit, it's sufficiently unlikely to encounter this problem in the first place that I'm not worried about that scenario. (The time required for a human to context switch from code editing to code execution and back sees to that)
msg304316 - (view)	Author: Devin Bayer (akvadrako) *	Date: 2017-10-13 07:33
You can't demand a hot loader to react instantly and there are other use cases, like generating modules programatically. What is your objection to my proposed solution, which behaves correctly in all cases? If you are not importing modules immediately after writing them, there is no harm in skipping the cache if the file mtime differs from the current time. If you are importing modules immediately, then you want to bypass the cache in that case, since it can't be relied upon.
msg304320 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-10-13 07:49
I wasn't clear on what you meant by "potentially in the future". Now that I realise you meant "Defer refreshing the bytecode cache to the next import attempt if `int(source_mtime) == int(time.time())`, but still bypass it for the current import", then yes, I agree that would reliably resolve the problem, since all imports during the same second as the edit would bypass the cache without updating it, and the first subsequent import would refresh it a timestamp that's guaranteed to be later than the source_mtime (not just equal). Brett, what do you think? Such an adjustment to the caching logic should have next to no impact in the typical case where `int(source_mtime) < int(time.time())`, while still making it even less likely that hot reloaders will accidentally cache stale bytecode.
msg304349 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2017-10-13 17:11
To make the proposal concrete, would you then change https://github.com/python/cpython/blob/master/Lib/importlib/_bootstrap_external.py#L785 to include a `source_mtime != int(time.time())` guard? I think as long as that's the last check in the guard since that has they highest performance cost it should be okay (I don't think any platform makes calling the clock an expensive operation). And nothing personal about pro-actively closing this, Devin, but with over 6000 issues in this issue tracker alone it's easier to ask forgiveness than permission when it comes to issues that at the outset don't look like a good idea. Otherwise I would have to track this issue personally in case you never responded again (which does happen), and that just becomes a maintenance burden on me when I have 3 other Python-related projects that I'm also actively working on issues with (ignoring any issues that I get involved with on this tracker).
msg304373 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-10-14 04:33
Aye, I think that check would make the most sense, since the bytecode invalidation check is "_r_long(raw_timestamp) != source_mtime" (to allow for things like version control operations that send source timestamps backwards). A test for that could then just mock time.time() to make sure it returned a time matching the source mtime, and checked that the bytecode wasn't written.
msg304464 - (view)	Author: Devin Bayer (akvadrako) *	Date: 2017-10-16 09:50
That seems like it makes the most sense, though I would suggest a higher-level test, because caching of bytecode is an implementation detail. So, while mocking time.time to be constant: 1. write dummy module with x=1 2. import dummy 3. overwrite dummy module with x=2 4. reload(dummy) 5. assert dummy.x == 2 BTW Brett, I didn't think it was personal; that would be odd considering you don't know me. But from a new contributor's perspective it didn't look like you are taking bugs seriously. If you had pointed out that you will reopen it if I made a good case, that would be been more clear.
msg395960 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-06-16 21:02
Looks like this was merged here: https://github.com/akvadrako/cpython/commit/fe9b43a0928a5ef80a4daf3a50af300e5eaeda9a Can we close?
msg396025 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2021-06-17 23:52
Did you mean to link to a fork, Irit?
msg396034 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-06-18 08:19
No, sorry it wasn't merged. Should it be?
msg396080 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2021-06-18 19:11
No idea, I was just trying to understand what the link was meant to point at. :)
msg396086 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-06-18 20:55
The discussion is leading to yes, but the patch needs some more work on the test. (The link was pointing at me being past my bedtime when I pasted it :-)

History
Date	User	Action	Args
2022-04-11 14:58:53	admin	set	github: 75953
2021-06-18 20:55:53	iritkatriel	set	messages: + msg396086
2021-06-18 19:11:26	brett.cannon	set	messages: + msg396080
2021-06-18 08:19:55	iritkatriel	set	messages: + msg396034
2021-06-17 23:52:26	brett.cannon	set	status: pending -> open messages: + msg396025
2021-06-16 21:02:38	iritkatriel	set	status: open -> pending nosy: + iritkatriel messages: + msg395960 resolution: fixed
2017-10-16 09:50:57	akvadrako	set	messages: + msg304464
2017-10-14 04:33:41	ncoghlan	set	messages: + msg304373
2017-10-13 17:11:58	brett.cannon	set	status: closed -> open resolution: wont fix -> (no value) messages: + msg304349 stage: resolved -> test needed
2017-10-13 07:49:55	ncoghlan	set	messages: + msg304320
2017-10-13 07:33:55	akvadrako	set	messages: + msg304316
2017-10-13 07:11:43	ncoghlan	set	messages: + msg304314
2017-10-13 06:51:57	akvadrako	set	messages: + msg304312
2017-10-13 06:47:07	ncoghlan	set	messages: + msg304311
2017-10-13 06:19:53	akvadrako	set	messages: + msg304310
2017-10-13 05:00:48	ncoghlan	set	messages: + msg304305
2017-10-12 20:04:58	akvadrako	set	messages: + msg304285
2017-10-12 18:32:32	brett.cannon	set	status: open -> closed nosy: + brett.cannon, ncoghlan, eric.snow messages: + msg304279 resolution: wont fix stage: patch review -> resolved
2017-10-12 13:48:09	akvadrako	set	keywords: + patch stage: patch review pull_requests: + pull_request3941
2017-10-12 13:43:00	akvadrako	create