classification
Title: Race condition in WatchedFileHandler leads to unhandled exception
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vinay.sajip Nosy List: brian.curtin, neologix, phlogistonjohn, python-dev, vinay.sajip
Priority: normal Keywords: patch

Created on 2012-04-20 13:57 by phlogistonjohn, last changed 2012-05-03 15:31 by vinay.sajip. This issue is now closed.

Files
File name Uploaded Description Edit
test.py phlogistonjohn, 2012-04-20 13:57
watchedfilehandler.patch phlogistonjohn, 2012-04-20 15:10 patch
watchedfilehandler.patch phlogistonjohn, 2012-04-22 18:41 patch, 2nd try
test_logging_race.diff neologix, 2012-04-24 18:55 review
Messages (20)
msg158829 - (view) Author: JohnM (phlogistonjohn) Date: 2012-04-20 13:57
The (very handy) logging.handlers.WatchedFileHandler appears to be  susceptible to a race condition that causes occasional OSErrors during the normal course of operations:


Traceback (most recent call last):
  File "test.py", line 58, in test_race
    log.warning('Foo bar %d', ii)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1144, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1250, in _log
    self.handle(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1260, in handle
    self.callHandlers(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1300, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 744, in handle
    self.emit(record)
  File "/usr/lib64/python2.7/logging/handlers.py", line 412, in emit
    stat = os.stat(self.baseFilename)
OSError: [Errno 2] No such file or directory: '/tmp/tmpG6_luRTestRaceWatchedFileHandler'

Looking at the implementation of the handler's emit function, I've commented on where I think the race can occur. Logrotate (or some similar application) is deleting/unlinking the logfile between the first os.path.exists and os.stat calls:

    def emit(self, record):
        """
        Emit a record.

        First check if the underlying file has changed, and if it
        has, close the old stream and reopen the file to get the
        current stream.
        """
        if not os.path.exists(self.baseFilename):
        #      ^^^ race occurs between here 
            stat = None
            changed = 1
        else:
            stat = os.stat(self.baseFilename)
            #      ^^^ and this stat call
            changed = (stat[ST_DEV] != self.dev) or (stat[ST_INO] != self.ino)
        if changed and self.stream is not None:
            self.stream.flush()
            self.stream.close()
            self.stream = self._open()
            if stat is None:
                stat = os.stat(self.baseFilename)
            self.dev, self.ino = stat[ST_DEV], stat[ST_INO]
        logging.FileHandler.emit(self, record)

I've attached a test script that attempts to recreate the issue. On my Linux system running the script is able to trigger the issue about every 7 of 10 runs.
msg158832 - (view) Author: JohnM (phlogistonjohn) Date: 2012-04-20 15:10
I'm attaching a patch of an emit function implementation that reduces the number of stats and uses fstat were we can. Using fstat should also prevent any races between opening the new file and stating the path in the file system.

Using this patch or something along these lines should avoid the time of check to time of use issues with the current version with repeated stat calls.
msg158915 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-04-21 11:15
Thanks for the report and patch.

I'm making Vinay nosy, as he's the logging expert.

A couple remarks:
- it would be nice to add your testcase to the logging tests (Lib/test/test_logging.py)
- I think you could probably save the whole fstat() result (instead of st_dev and st_inode): this would allow you to use os.pathsamestat() to check whether we're still refering to the same file, and it could maybe be useful someday (maybe use st_mtime)
- typo: "existance" -> "existence"
- I see you're catching OSError when calling fstat(): however, I don't see what error could be raised apart from EBADF (if the FD is closed, then it's likely a bug that shouldn't be masked, and it will blow up on the next write()), or I/O errors, that shouldn't be silenced
msg158976 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-22 17:47
There's certainly a problem to be addressed here, but integrating the test into the regular test suite represents a problem in that the test takes around 10 minutes to run, and sometimes completes successfully. I'm not sure a 10-minute extension of the time taken to run regrtest will be acceptable, but if the test is skipped unless explicitly enabled, then I'm not sure it'll add much value.

I will look at the patch and Charles-François' review comments soon.
msg158979 - (view) Author: JohnM (phlogistonjohn) Date: 2012-04-22 18:41
Thank you both for looking at this. I've added an updated version of the patch that incorporates the last two suggestions that Charles-François made.

I agree that this test may not be appropriate for the python test suite, due to the length and non-determinism of the test. I don't know what the suite's policy on monkey patching stdlib in the tests is, but monkey patching either os.path.exists or os.stat to remove the file at the appropriate time could be one way to make the test fast and deterministic. Seems a bit dirty to me though. :-)
 
I am up for adding some kind of test to the suite for the WatchedFileHandler, though.

I'm pretty ambivalent about keeping the whole stat structure around, since the samestat method is just a wrapper around what the emit function is already doing, and we'd be keeping more memory (although a small amount) around. I doubt we'd want to look at the timestamps because they could legitimately change in ways this check doesn't care about.
msg158980 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-04-22 18:42
> There's certainly a problem to be addressed here, but integrating the test into the regular test suite represents a problem in that the test takes around 10 minutes to run, and sometimes completes successfully. I'm not sure a 10-minute extension of the time taken to run regrtest will be acceptable, but if the test is skipped unless explicitly enabled, then I'm not sure it'll add much value.

There's no reason to wait for 10 minutes: one can reduce the number of
iterations greatly, so that it can run in a matter of seconds. Since
the test is run by all the builders multiple times per day, should the
race condition resurface, it will trigger a buildbot failure some day.
No need for it to trigger a failure every time.
msg158982 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-22 19:16
John: Thanks for the updated patch.

Charles-François: Certainly I can reduce the iterations to make the test faster. As it is, I did reproduce the failure on my dev system, but only once, after running John's test script about a dozen times: every other time, it completed successfully :-(

Isn't it simpler if I just replace the os.path.exists() calls with os.stat(), and check for ENOENT if an exception of type OSError or WindowsError occurs?
msg159046 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-04-23 17:05
> Charles-François: Certainly I can reduce the iterations to make the test
> faster. As it is, I did reproduce the failure on my dev system, but only
> once, after running John's test script about a dozen times: every other
> time, it completed successfully :-(

Juste reduce the sleep times, e.g.:
"""
@@ -17,7 +17,7 @@
             os.unlink(fname)
         except OSError:
             pass
-        stime = 0.04 * random.randint(0, 4)
+        stime = 0.004 * random.randint(0, 4)
         time.sleep(stime)

@@ -50,7 +50,7 @@
         log.setLevel(logging.INFO)

         for ii in range(LOGCOUNT):
-            stime = 0.05 # * random.randint(0, 4)
+            stime = 0.005 # * random.randint(0, 4)
             time.sleep(stime)
             log.warning('Foo bar %d', ii)
"""

With this change, I can trigger a failure reliably in around 1s, and
my computer is rather slow.

> Isn't it simpler if I just replace the os.path.exists() calls with
> os.stat(), and check for ENOENT if an exception of type OSError or
> WindowsError occurs?

The problem is that it would leave a race window if the file is
changed between the time it's opened (I guess in
logging.FileHandler.__init__()) and the first call to stat().
John's patch is safe in this regard, thanks to fstat().
msg159070 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-23 20:02
[snip]
>With this change, I can trigger a failure reliably in around 1s, and
>my computer is rather slow.

I'm working in a VM, and although I can get John's script to fail more regularly (with the reduced timeouts and counts of 1000), a version of the test which I added to test_logging always succeeds. That code is:

    @unittest.skipUnless(threading, 'Threading required for this test.')
    def test_race(self):
        # Issue #14632 refers.
        def remove_loop(fname, tries):
            for _ in range(tries):
                try:
                    os.unlink(fname)
                except OSError:
                    pass
                time.sleep(0.004 * random.randint(0, 4))

        def cleanup(remover, fn, handler):
            handler.close()
            remover.join()
            if os.path.exists(fn):
                os.unlink(fn)

        fd, fn = tempfile.mkstemp('.log', 'test_logging-3-')
        os.close(fd)
        del_count = 1000
        log_count = 1000
        remover = threading.Thread(target=remove_loop, args=(fn, del_count))
        remover.daemon = True
        remover.start()
        h = logging.handlers.WatchedFileHandler(fn)
        self.addCleanup(cleanup, remover, fn, h)
        f = logging.Formatter('%(asctime)s: %(levelname)s: %(message)s')
        h.setFormatter(f)
        for _ in range(log_count):
            time.sleep(0.005)
            r = logging.makeLogRecord({'msg': 'testing' })
            h.handle(r)

I can't see why this always works, while John's script sometimes fails.
>The problem is that it would leave a race window if the file is
>changed between the time it's opened (I guess in
>logging.FileHandler.__init__()) and the first call to stat().
>John's patch is safe in this regard, thanks to fstat().

Oh, right - missed that.
msg159190 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-04-24 18:55
> I can't see why this always works, while John's script sometimes fails.

It does fail consistently on my machine.
I'm attaching the diff just in case.
msg159226 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-04-24 22:33
New changeset 15a33d7d2b50 by Vinay Sajip in branch '2.7':
Issue #14632: Updated WatchedFileHandler to deal with race condition. Thanks to John Mulligan for the problem report and patch.
http://hg.python.org/cpython/rev/15a33d7d2b50

New changeset 5de7c3d64f2a by Vinay Sajip in branch '3.2':
Issue #14632: Updated WatchedFileHandler to deal with race condition. Thanks to John Mulligan for the problem report and patch.
http://hg.python.org/cpython/rev/5de7c3d64f2a

New changeset 380821b47872 by Vinay Sajip in branch 'default':
Issue #14632: Updated WatchedFileHandler to deal with race condition. Thanks to John Mulligan for the problem report and patch.
http://hg.python.org/cpython/rev/380821b47872
msg159228 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-24 22:42
I will leave this open for a few days and see if the buildbots throw up anything.
msg159387 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-26 15:47
Buildbots seem not to be complaining, so closing.
msg159770 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2012-05-02 01:55
The test for this issue seems to fail about half of the time on Windows.


======================================================================
ERROR: test_race (test.test_logging.HandlerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\python-dev\cpython-main\lib\test\test_logging.py", line 605, in test_race
    h = logging.handlers.WatchedFileHandler(fn, delay=delay)
  File "c:\python-dev\cpython-main\lib\logging\handlers.py", line 421, in __init__
    logging.FileHandler.__init__(self, filename, mode, encoding, delay)
  File "c:\python-dev\cpython-main\lib\logging\__init__.py", line 965, in __init__
    StreamHandler.__init__(self, self._open())
  File "c:\python-dev\cpython-main\lib\logging\__init__.py", line 987, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: 'c:\\users\\brian\\appdata\\local\\temp\\test_logging-3-lxjo5t.log'


I also get this failure when running in the VS2010 branch:

======================================================================
ERROR: test_race (test.test_logging.HandlerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\python-dev\porting\cpython\lib\test\test_logging.py", line 600, in cleanup
    os.unlink(fn)
PermissionError: [Error 32] The process cannot access the file because it is being used by
 another process: 'c:\\users\\brian\\appdata\\local\\temp\\test_logging3-6ujeil.log'
msg159820 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-05-02 21:01
Re. the error on open(), I have no idea why it's failing. That's just open, and it appears to be trying to open a file in a directory for which one would expect you to have write access.

Re. the error on unlink(), could that be an older version of the code that you're running? There was a cleanup() function, but I have removed it (and re-tested) before closing the issue.

The 3.x buildbots appear not to be showing this failure.
msg159821 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2012-05-02 21:22
I'm seeing this with the current tip 8635825b9734.

I wouldn't trust the build slaves with a race condition test since they're incredibly slow machines, but this issue isn't about the race. That path really should be accessible so I'm not sure why it's failing. It's also failing for Richard (sbt) as noted while working on the VS2010 port.
msg159825 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-05-03 00:14
The only thing I can think of is that the file is kept open by something else (the other thread).
msg159829 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-05-03 01:19
The changeset you mentioned doesn't have a cleanup function in test_race (an older version used to) - so I'm not sure where that error is coming from.

It's going to be hard for me to reproduce this - I was finding it pretty hard to throw up the original race condition which led to the test being added. If you are getting the error repeatably, it would be really handy if you could tweak the test to generate some more diagnostic information - e.g. if the file already exists when the open fails (=> it's held open by something else).

Are you running indexing or anti-virus software on the machine where you get the failures? That can hold a file open when you're not expecting it.
msg159830 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2012-05-03 01:22
I have exemptions set in AV for my dev folders for exactly that reason :)
I'll try and poke around and get more info.
msg159862 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-05-03 15:31
I noticed that in my cleanup code, I had the lines

h.close()
remover.join()

but it makes more sense for these to be in the opposite order. I've made that changed and pushed it up (for 2.7, 3.2 and default).
History
Date User Action Args
2012-05-03 15:31:00vinay.sajipsetmessages: + msg159862
2012-05-03 01:22:52brian.curtinsetmessages: + msg159830
2012-05-03 01:19:23vinay.sajipsetmessages: + msg159829
2012-05-03 00:14:38vinay.sajipsetmessages: + msg159825
2012-05-02 21:22:14brian.curtinsetmessages: + msg159821
2012-05-02 21:01:11vinay.sajipsetmessages: + msg159820
2012-05-02 01:55:46brian.curtinsetnosy: + brian.curtin
messages: + msg159770
2012-04-26 15:47:52vinay.sajipsetstatus: open -> closed

messages: + msg159387
2012-04-24 22:42:10vinay.sajipsetresolution: fixed
messages: + msg159228
2012-04-24 22:33:51python-devsetnosy: + python-dev
messages: + msg159226
2012-04-24 18:55:05neologixsetfiles: + test_logging_race.diff

messages: + msg159190
2012-04-23 20:02:55vinay.sajipsetmessages: + msg159070
2012-04-23 17:05:35neologixsetmessages: + msg159046
2012-04-22 19:16:30vinay.sajipsetmessages: + msg158982
2012-04-22 18:42:18neologixsetmessages: + msg158980
2012-04-22 18:41:57phlogistonjohnsetfiles: + watchedfilehandler.patch

messages: + msg158979
2012-04-22 17:47:04vinay.sajipsetmessages: + msg158976
2012-04-21 22:51:28vinay.sajipsetassignee: vinay.sajip
2012-04-21 11:15:19neologixsetnosy: + vinay.sajip, neologix

messages: + msg158915
versions: + Python 3.2, Python 3.3
2012-04-20 15:10:10phlogistonjohnsetfiles: + watchedfilehandler.patch
keywords: + patch
messages: + msg158832
2012-04-20 13:57:07phlogistonjohncreate