classification
Title: import machinery vulnerable to timestamp collisions
Type: behavior Stage: resolved
Components: Interpreter Core, Tests Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: Arfrever, barry, brett.cannon, eric.snow, ncoghlan, neologix, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2011-12-21 12:32 by pitrou, last changed 2012-01-13 17:56 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
impsize.patch pitrou, 2011-12-29 21:50 review
impsize2.patch pitrou, 2012-01-12 05:27 review
Messages (19)
msg149982 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 12:32
test_import fails when run directly after test_coding:

$ ./python -m test test_coding test_import
[1/2] test_coding
[2/2] test_import
test test_import failed -- Traceback (most recent call last):
  File "/home/antoine/cpython/default/Lib/test/test_import.py", line 115, in test_creation_mode
    self.assertEqual(stat.S_IMODE(s.st_mode), 0o666 & ~mask)
AssertionError: 436 != 420


384 is 0o600 while 420 is 0o644.

(seen sporadically on some buildbots, e.g.
http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/1462/steps/test/logs/stdio
)


The same thing happens in 3.2, although the failing test is a different one:

$ ./python -m test test_coding test_import
[1/2] test_coding
[2/2] test_import
test test_import failed -- Traceback (most recent call last):
  File "/home/antoine/cpython/32/Lib/test/test_import.py", line 117, in test_execute_bit_not_copied
    stat.S_IRUSR | stat.S_IRGRP | stat.S_IROTH)
AssertionError: 436 != 292
msg149983 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 12:41
The cause is that the import machinery checks the timestamp stored in the pyc file to decide whether it must be refreshed. Depending on the system's timestamp granularity, there can be false negatives: the py file was modified twice with the same timestamp, but the pyc file isn't regenerated for the second version of the py file. Adding a time.sleep(1) call to the failing test case makes it pass.

Correct fix for 3.2's tests is the following. I don't know if the import machinery can be improved not to exhibit any false negatives:

diff --git a/Lib/test/test_import.py b/Lib/test/test_import.py
--- a/Lib/test/test_import.py
+++ b/Lib/test/test_import.py
@@ -107,8 +107,9 @@ class ImportTests(unittest.TestCase):
                 open(fname, 'w').close()
                 os.chmod(fname, (stat.S_IRUSR | stat.S_IRGRP | stat.S_IROTH |
                                  stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH))
+                fn = imp.cache_from_source(fname)
+                unlink(fn)
                 __import__(TESTFN)
-                fn = imp.cache_from_source(fname)
                 if not os.path.exists(fn):
                     self.fail("__import__ did not result in creation of "
                               "either a .pyc or .pyo file")
msg149984 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 12:54
The pyc format currently stores the modification time cast to a 32-bit int. A 3.3 iteration of the pyc format could instead store two 32-bit ints, one for the integral part and one for the fractional part (e.g. in nanoseconds). When HAVE_STAT_TV_NSEC is defined, struct stat has a "st_mtim" member which is a struct timespec providing theoretical nanosecond precision.

Whether this would be useful remains to be decided.
msg149993 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-21 14:48
> a struct timespec providing theoretical nanosecond precision.

Indeed. EXT3's timestamps have a 1s granularity, for example.
Another possibility would be to store both mtime and st_size (it's the default heuristic used by rsync does to decide whether to skip a file).

> Whether this would be useful remains to be decided.

Dunno.
On the one hand, it's frustrating to think that you can end up with a stale bytecode file forever (I've had more or less the same problem because of NFS mounts, and it can be quite puzzling when the last modification you made to the source file is not taken into account).
OTOH, nobody has ever complained about this...
msg149994 - (view) Author: Roundup Robot (python-dev) Date: 2011-12-21 14:54
New changeset f4afc6858ed2 by Antoine Pitrou in branch '3.2':
Issue #13645: fix test_import failure when run immediately after test_coding.
http://hg.python.org/cpython/rev/f4afc6858ed2

New changeset a6bd166abde5 by Antoine Pitrou in branch 'default':
Issue #13645: fix test_import failure when run immediately after test_coding.
http://hg.python.org/cpython/rev/a6bd166abde5
msg149995 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 14:56
In the meantime, I've committed the test patch.
msg149997 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 15:13
> Indeed. EXT3's timestamps have a 1s granularity, for example.
> Another possibility would be to store both mtime and st_size (it's the
> default heuristic used by rsync does to decide whether to skip a
> file).

Good idea. It would also avoid too much Windows-specific code (subsecond
precision for timestamps requires the use of a Win32 API).
msg150350 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-29 21:50
Here is a patch adding the source code size to the pyc header.
The number of places where details of the pyc file format are hard coded is surprisingly high...
Unfortunately, I had to modify importlib's public API (path_mtime -> path_stats). I find it unfortunate that importlib's API is vulnerable to pyc format changes.
msg150390 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2011-12-30 20:50
Why change importlib's API and instead add to it? You could add the requisite path_size() method to get the value, and assume 0 means unsupported (at least until some future version where a deprecation warning about not requiring file size comes about). But I don't see how you can get around not exposing these details as stored in some API, whether it is by method or object attribute since you can't assume stat results since support for non-file storage back-ends would be unduly burdened.
msg150432 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-01 15:09
> You could add the requisite path_size() method to get the value, and
> assume 0 means unsupported

I thought:
- calling two methods means two stat calls per file, this could be slightly inefficient
- if future extensions of the import mechanism require yet more stat information (for example owner or chmod), it will be yet another bunch of stat'ing methods to create

(besides, calling int() on the timestamp is a loss of information, I don't understand why this must be done in path_mtime() rather than let the consumer do whatever it wants with the higher-precision timestamp)
msg150439 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-01 18:27
The patch looks good to me.
msg150984 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2012-01-09 22:18
I'm not suggesting two stat calls (in the general case); you would call one or the other depending on the magic number of the pyc file.

Anyway, it would probably be best to have some method that is expected to return a specific object which embodies all the desired information for bytecode generation (and if you encompass source code with this object then you can get rid of get_source() as well). But it shouldn't be a raw stat object since not all bits of information will come from a stat call (eg. storing bytecode in a sqlite3 database) and thus require bogus data. If you want to move towards that kind of API I can support that.

As for path_mtime() returning an int instead of some number that can be converted to an int, that's because I didn't plan for the "Antoine wants to muck with the .pyc format" contingency (IOW I just didn't think about it). =) It wouldn't be a big deal to change the API to take a keyword-only argument specifying you want the highest resolution number instead of specifically an int.
msg150985 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-09 23:05
> I'm not suggesting two stat calls (in the general case); you would
> call one or the other depending on the magic number of the pyc file.

The proposal is to store both mtime and size, actually, to make false
positives less likely.
msg151030 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2012-01-10 21:40
On Mon, Jan 9, 2012 at 18:05, Antoine Pitrou <report@bugs.python.org> wrote:

>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> > I'm not suggesting two stat calls (in the general case); you would
> > call one or the other depending on the magic number of the pyc file.
>
> The proposal is to store both mtime and size, actually, to make false
> positives less likely.

Then a method that returns some object representing all of the needed info
on the source file is needed, but that does NOT blindly return the stat
info (eg. modification date, "file" size, and even source as bytes can be
in this object).  It could even have methods to generate the bytecode, etc.
msg151112 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-12 05:27
> Then a method that returns some object representing all of the needed info
> on the source file is needed, but that does NOT blindly return the stat
info

Then it's simpler for the object to be a dict, and backwards compatibility is straightforward (by ignoring absent as well as unknown keys). Patch attached.
msg151139 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2012-01-12 17:03
LGTM (although I didn't run the unit tests and focused mainly on the importlib._bootstrap and abc changes). Only two things I would change. One is possibly deprecating path_mtime() so people don't waste time implementing it (we actually never need to remove it thanks to the ABC; otherwise we need to make sure the docs strongly state to only bother with path_stats()). The other is to say the mtime key should contain a value that is a real number (ie. float and any other numeric type that can cast to an integer). That way you get your higher resolution in the dict while still being able to use the value in writing the bytecode.

And is there any efficient way to get the stat info on a file AND its contents in a single call? If so we might want a method to support that (eg. add a 'source_bytes' key or something), but I can't think of any low-level call that supports that kind of thing.
msg151180 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-13 17:05
> One is possibly deprecating path_mtime() so people don't waste time
> implementing it (we actually never need to remove it thanks to the
> ABC; otherwise we need to make sure the docs strongly state to only
> bother with path_stats()).

Ok, I saw I also forgot to update some importlib docs.

> The other is to say the mtime key should contain a value that is a
> real number (ie. float and any other numeric type that can cast to an
> integer).

Ok.

> And is there any efficient way to get the stat info on a file AND its
> contents in a single call?

I don't think so.  os.fstat() on an open fd looks minimally faster than
os.stat() on the filename (0.5µs faster here on Linux), but opening the
file has its own cost.
msg151183 - (view) Author: Roundup Robot (python-dev) Date: 2012-01-13 17:53
New changeset 87331661042b by Antoine Pitrou in branch 'default':
Issue #13645: pyc files now contain the size of the corresponding source
http://hg.python.org/cpython/rev/87331661042b
msg151184 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-13 17:56
Now pushed in. Thanks for the reviews!
History
Date User Action Args
2012-01-13 17:56:59pitrousetstatus: open -> closed
resolution: fixed
messages: + msg151184

stage: patch review -> resolved
2012-01-13 17:53:53python-devsetmessages: + msg151183
2012-01-13 17:05:38pitrousetmessages: + msg151180
2012-01-12 17:03:37brett.cannonsetassignee: pitrou
messages: + msg151139
2012-01-12 05:27:29pitrousetfiles: + impsize2.patch

messages: + msg151112
2012-01-10 21:40:23brett.cannonsetmessages: + msg151030
2012-01-09 23:05:07pitrousetmessages: + msg150985
2012-01-09 22:18:09brett.cannonsetmessages: + msg150984
2012-01-01 18:27:40neologixsetmessages: + msg150439
2012-01-01 15:09:31pitrousetmessages: + msg150432
2011-12-30 20:50:46brett.cannonsetmessages: + msg150390
2011-12-29 23:26:40Arfreversetnosy: + Arfrever
2011-12-29 21:50:08pitrousetfiles: + impsize.patch
versions: - Python 3.2
title: test_import fails after test_coding -> import machinery vulnerable to timestamp collisions
messages: + msg150350

keywords: + patch
2011-12-21 16:40:45eric.snowsetnosy: + eric.snow
2011-12-21 15:13:00pitrousetmessages: + msg149997
2011-12-21 14:56:41pitrousetmessages: + msg149995
2011-12-21 14:54:01python-devsetnosy: + python-dev
messages: + msg149994
2011-12-21 14:48:41neologixsetmessages: + msg149993
2011-12-21 12:54:12pitrousetmessages: + msg149984
2011-12-21 12:41:18pitrousetnosy: + barry

messages: + msg149983
stage: patch review
2011-12-21 12:32:17pitroucreate