Issue 13645: import machinery vulnerable to timestamp collisions

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57854

classification

Title:	import machinery vulnerable to timestamp collisions
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Tests	Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	pitrou	Nosy List:	Arfrever, ammar2, barry, brett.cannon, eric.snow, ncoghlan, neologix, pitrou, python-dev
Priority:	normal	Keywords:	patch

Created on 2011-12-21 12:32 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
impsize.patch	pitrou, 2011-12-29 21:50		review
impsize2.patch	pitrou, 2012-01-12 05:27		review

Messages (19)
msg149982 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 12:32
test_import fails when run directly after test_coding: $ ./python -m test test_coding test_import [1/2] test_coding [2/2] test_import test test_import failed -- Traceback (most recent call last): File "/home/antoine/cpython/default/Lib/test/test_import.py", line 115, in test_creation_mode self.assertEqual(stat.S_IMODE(s.st_mode), 0o666 & ~mask) AssertionError: 436 != 420 384 is 0o600 while 420 is 0o644. (seen sporadically on some buildbots, e.g. http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/1462/steps/test/logs/stdio ) The same thing happens in 3.2, although the failing test is a different one: $ ./python -m test test_coding test_import [1/2] test_coding [2/2] test_import test test_import failed -- Traceback (most recent call last): File "/home/antoine/cpython/32/Lib/test/test_import.py", line 117, in test_execute_bit_not_copied stat.S_IRUSR \| stat.S_IRGRP \| stat.S_IROTH) AssertionError: 436 != 292
msg149983 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 12:41
The cause is that the import machinery checks the timestamp stored in the pyc file to decide whether it must be refreshed. Depending on the system's timestamp granularity, there can be false negatives: the py file was modified twice with the same timestamp, but the pyc file isn't regenerated for the second version of the py file. Adding a time.sleep(1) call to the failing test case makes it pass. Correct fix for 3.2's tests is the following. I don't know if the import machinery can be improved not to exhibit any false negatives: diff --git a/Lib/test/test_import.py b/Lib/test/test_import.py --- a/Lib/test/test_import.py +++ b/Lib/test/test_import.py @@ -107,8 +107,9 @@ class ImportTests(unittest.TestCase): open(fname, 'w').close() os.chmod(fname, (stat.S_IRUSR \| stat.S_IRGRP \| stat.S_IROTH \| stat.S_IXUSR \| stat.S_IXGRP \| stat.S_IXOTH)) + fn = imp.cache_from_source(fname) + unlink(fn) __import__(TESTFN) - fn = imp.cache_from_source(fname) if not os.path.exists(fn): self.fail("__import__ did not result in creation of " "either a .pyc or .pyo file")
msg149984 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 12:54
The pyc format currently stores the modification time cast to a 32-bit int. A 3.3 iteration of the pyc format could instead store two 32-bit ints, one for the integral part and one for the fractional part (e.g. in nanoseconds). When HAVE_STAT_TV_NSEC is defined, struct stat has a "st_mtim" member which is a struct timespec providing theoretical nanosecond precision. Whether this would be useful remains to be decided.
msg149993 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-12-21 14:48
> a struct timespec providing theoretical nanosecond precision. Indeed. EXT3's timestamps have a 1s granularity, for example. Another possibility would be to store both mtime and st_size (it's the default heuristic used by rsync does to decide whether to skip a file). > Whether this would be useful remains to be decided. Dunno. On the one hand, it's frustrating to think that you can end up with a stale bytecode file forever (I've had more or less the same problem because of NFS mounts, and it can be quite puzzling when the last modification you made to the source file is not taken into account). OTOH, nobody has ever complained about this...
msg149994 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-12-21 14:54
New changeset f4afc6858ed2 by Antoine Pitrou in branch '3.2': Issue #13645: fix test_import failure when run immediately after test_coding. http://hg.python.org/cpython/rev/f4afc6858ed2 New changeset a6bd166abde5 by Antoine Pitrou in branch 'default': Issue #13645: fix test_import failure when run immediately after test_coding. http://hg.python.org/cpython/rev/a6bd166abde5
msg149995 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 14:56
In the meantime, I've committed the test patch.
msg149997 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 15:13
> Indeed. EXT3's timestamps have a 1s granularity, for example. > Another possibility would be to store both mtime and st_size (it's the > default heuristic used by rsync does to decide whether to skip a > file). Good idea. It would also avoid too much Windows-specific code (subsecond precision for timestamps requires the use of a Win32 API).
msg150350 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-29 21:50
Here is a patch adding the source code size to the pyc header. The number of places where details of the pyc file format are hard coded is surprisingly high... Unfortunately, I had to modify importlib's public API (path_mtime -> path_stats). I find it unfortunate that importlib's API is vulnerable to pyc format changes.
msg150390 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2011-12-30 20:50
Why change importlib's API and instead add to it? You could add the requisite path_size() method to get the value, and assume 0 means unsupported (at least until some future version where a deprecation warning about not requiring file size comes about). But I don't see how you can get around not exposing these details as stored in some API, whether it is by method or object attribute since you can't assume stat results since support for non-file storage back-ends would be unduly burdened.
msg150432 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-01 15:09
> You could add the requisite path_size() method to get the value, and > assume 0 means unsupported I thought: - calling two methods means two stat calls per file, this could be slightly inefficient - if future extensions of the import mechanism require yet more stat information (for example owner or chmod), it will be yet another bunch of stat'ing methods to create (besides, calling int() on the timestamp is a loss of information, I don't understand why this must be done in path_mtime() rather than let the consumer do whatever it wants with the higher-precision timestamp)
msg150439 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-01-01 18:27
The patch looks good to me.
msg150984 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-01-09 22:18
I'm not suggesting two stat calls (in the general case); you would call one or the other depending on the magic number of the pyc file. Anyway, it would probably be best to have some method that is expected to return a specific object which embodies all the desired information for bytecode generation (and if you encompass source code with this object then you can get rid of get_source() as well). But it shouldn't be a raw stat object since not all bits of information will come from a stat call (eg. storing bytecode in a sqlite3 database) and thus require bogus data. If you want to move towards that kind of API I can support that. As for path_mtime() returning an int instead of some number that can be converted to an int, that's because I didn't plan for the "Antoine wants to muck with the .pyc format" contingency (IOW I just didn't think about it). =) It wouldn't be a big deal to change the API to take a keyword-only argument specifying you want the highest resolution number instead of specifically an int.
msg150985 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-09 23:05
> I'm not suggesting two stat calls (in the general case); you would > call one or the other depending on the magic number of the pyc file. The proposal is to store both mtime and size, actually, to make false positives less likely.
msg151030 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-01-10 21:40
On Mon, Jan 9, 2012 at 18:05, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > I'm not suggesting two stat calls (in the general case); you would > > call one or the other depending on the magic number of the pyc file. > > The proposal is to store both mtime and size, actually, to make false > positives less likely. Then a method that returns some object representing all of the needed info on the source file is needed, but that does NOT blindly return the stat info (eg. modification date, "file" size, and even source as bytes can be in this object). It could even have methods to generate the bytecode, etc.
msg151112 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-12 05:27
> Then a method that returns some object representing all of the needed info > on the source file is needed, but that does NOT blindly return the stat info Then it's simpler for the object to be a dict, and backwards compatibility is straightforward (by ignoring absent as well as unknown keys). Patch attached.
msg151139 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-01-12 17:03
LGTM (although I didn't run the unit tests and focused mainly on the importlib._bootstrap and abc changes). Only two things I would change. One is possibly deprecating path_mtime() so people don't waste time implementing it (we actually never need to remove it thanks to the ABC; otherwise we need to make sure the docs strongly state to only bother with path_stats()). The other is to say the mtime key should contain a value that is a real number (ie. float and any other numeric type that can cast to an integer). That way you get your higher resolution in the dict while still being able to use the value in writing the bytecode. And is there any efficient way to get the stat info on a file AND its contents in a single call? If so we might want a method to support that (eg. add a 'source_bytes' key or something), but I can't think of any low-level call that supports that kind of thing.
msg151180 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-13 17:05
> One is possibly deprecating path_mtime() so people don't waste time > implementing it (we actually never need to remove it thanks to the > ABC; otherwise we need to make sure the docs strongly state to only > bother with path_stats()). Ok, I saw I also forgot to update some importlib docs. > The other is to say the mtime key should contain a value that is a > real number (ie. float and any other numeric type that can cast to an > integer). Ok. > And is there any efficient way to get the stat info on a file AND its > contents in a single call? I don't think so. os.fstat() on an open fd looks minimally faster than os.stat() on the filename (0.5µs faster here on Linux), but opening the file has its own cost.
msg151183 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-01-13 17:53
New changeset 87331661042b by Antoine Pitrou in branch 'default': Issue #13645: pyc files now contain the size of the corresponding source http://hg.python.org/cpython/rev/87331661042b
msg151184 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-13 17:56
Now pushed in. Thanks for the reviews!

History
Date	User	Action	Args
2022-04-11 14:57:24	admin	set	github: 57854
2020-04-25 03:38:57	ammar2	set	pull_requests: - pull_request19023
2020-04-24 19:35:23	ammar2	set	nosy: + ammar2 pull_requests: + pull_request19023
2012-01-13 17:56:59	pitrou	set	status: open -> closed resolution: fixed messages: + msg151184 stage: patch review -> resolved
2012-01-13 17:53:53	python-dev	set	messages: + msg151183
2012-01-13 17:05:38	pitrou	set	messages: + msg151180
2012-01-12 17:03:37	brett.cannon	set	assignee: pitrou messages: + msg151139
2012-01-12 05:27:29	pitrou	set	files: + impsize2.patch messages: + msg151112
2012-01-10 21:40:23	brett.cannon	set	messages: + msg151030
2012-01-09 23:05:07	pitrou	set	messages: + msg150985
2012-01-09 22:18:09	brett.cannon	set	messages: + msg150984
2012-01-01 18:27:40	neologix	set	messages: + msg150439
2012-01-01 15:09:31	pitrou	set	messages: + msg150432
2011-12-30 20:50:46	brett.cannon	set	messages: + msg150390
2011-12-29 23:26:40	Arfrever	set	nosy: + Arfrever
2011-12-29 21:50:08	pitrou	set	files: + impsize.patch versions: - Python 3.2 title: test_import fails after test_coding -> import machinery vulnerable to timestamp collisions messages: + msg150350 keywords: + patch
2011-12-21 16:40:45	eric.snow	set	nosy: + eric.snow
2011-12-21 15:13:00	pitrou	set	messages: + msg149997
2011-12-21 14:56:41	pitrou	set	messages: + msg149995
2011-12-21 14:54:01	python-dev	set	nosy: + python-dev messages: + msg149994
2011-12-21 14:48:41	neologix	set	messages: + msg149993
2011-12-21 12:54:12	pitrou	set	messages: + msg149984
2011-12-21 12:41:18	pitrou	set	nosy: + barry messages: + msg149983 stage: patch review
2011-12-21 12:32:17	pitrou	create