classification
Title: file_dealloc() assumes errno is set when EOF is returned
Type: behavior Stage:
Components: Interpreter Core Versions: Python 2.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: johansen, laca, schmir
Priority: normal Keywords:

Created on 2008-05-30 21:54 by johansen, last changed 2011-06-01 20:42 by schmir.

Messages (5)
msg67560 - (view) Author: johansen (johansen) Date: 2008-05-30 21:54
We're using Python to build the new packaging system for OpenSolaris. 
Yesterday, a user reported that when they ran the pkg command, piped the
output to grep, and then typed ^C, sometimes they'd get this error:

$ pkg list | grep office
^Cclose failed: [Errno 11] Resource temporarily unavailable

We assumed that this might be a problem in the signal handling we've
employed to catch SIGPIPE; however, it turns out that the problem is in
the file_dealloc() code.

For the perversely curious, additional details may be found in the
original bug located here:

http://defect.opensolaris.org/bz/show_bug.cgi?id=2083

Essentially we found the following:

The error message is emitted from fileobject.c: file_dealloc()

The relevant portion of the routine looks like this:

static void
file_dealloc(PyFileObject *f)
{
        int sts = 0;
        if (f->weakreflist != NULL)
                PyObject_ClearWeakRefs((PyObject *) f);
        if (f->f_fp != NULL && f->f_close != NULL) {
                Py_BEGIN_ALLOW_THREADS
                sts = (*f->f_close)(f->f_fp);
                Py_END_ALLOW_THREADS
                if (sts == EOF) 
#ifdef HAVE_STRERROR
                        PySys_WriteStderr("close failed: [Errno %d] %s\n",
errno, strerror(errno)); 

In the cases we encountered, the function pointer f_close is actually a
call to sysmodule.c: _check_and_flush()

That routine looks like this:

static int
_check_and_flush (FILE *stream)
{
  int prev_fail = ferror (stream);
  return fflush (stream) || prev_fail ? EOF : 0;
}

check_and_flush calls ferror(3C) and then fflush(3C) on the FILE stream
associated with the file object.  There's just one problem here.  If it
finds an error that was previously encountered on the file stream,
there's no guarantee that errno will be valid.  Should an error be
encountered in fflush(3C), errno will get set; however, the contents of
errno are undefined should fflush() return successfully.

Here's what happens in the code I observed:

I set a write watchpoint on errno and observed the different times it
was accessed.  After sifting through a bunch of red-herrings, I found
that a call to PyThread_acquire_lock() that sets errno to 11 (EAGAIN). 
This occurs when PyThread_acquire_lock() calls sem_trywait(3C) and finds
the semaphore already locked.  Errno doesn't get accessed again until a
call to libc.so.1`isseekable() that simply saves and restores the
existing errno.

Since we've taken a ^C (SIGINT), the interpreter begins the finalization
process and eventually calls file_dealloc().  This routine calls
_check_and_flush().  In the case that I observed, ferror(3C)
returns a non-zero value but fflush(3C) completes successfully.  This
causes the routine to return EOF to the caller.  file_dealloc() assumes
that since it received an EOF an error occurred and it should call
strerror(errno).  However, since this is just returning the state of a
previous error, errno is invalid.

This is what causes the spurious EAGAIN message.  Just to be sure, I
traced the return value and errno of failed syscalls that were invoked
by the interpreter.  I was unable to observe any syscalls returning
EAGAIN.  This is because (at least on OpenSolaris) sem_trywait(3C) calls
sema_trywait(3C).  The sema_trywait returns EBUSY if the semaphore is
held and sem_trywait converts this to EAGAIN.  None of these errors are
passed out of the kernel.


It's not clear to me whether _check_and_flush(), file_dealloc(), or both
need modification.  At a minimum, it's not safe for file_dealloc() to
assume that errno is set correctly if the function underneath it is
using ferror(3C) to find the presence of an error on the stream.
msg72072 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-28 09:34
The title of your bug report might be misleading. Is the problem that
errno is misinterpreted in the error message, or that there is an error
message at all?
msg72094 - (view) Author: johansen (johansen) Date: 2008-08-28 16:24
The problem is present in Python 2.4.4, the version that we're using
here.  I'm not familiar with the versions keyword that's used here, but
that's the version for which I'm reporting this bug.

To be clear, the problem is that when file_dealloc() sees an EOF, it
assumes that an error has occurred.  It then checks errno.  However,
it's possible for file_dealloc() to get an EOF under conditions where
errno hasn't been set.  In that case, it reports an error with a stale
value of errno.  This is explained in detail in the initial report.  I
don't think the title is misleading; it's incorrect for file_dealloc()
to assume that errno is set every time it gets an EOF.
msg72095 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-28 16:33
Selon johansen <report@bugs.python.org>:
>
> The problem is present in Python 2.4.4, the version that we're using
> here.  I'm not familiar with the versions keyword that's used here, but
> that's the version for which I'm reporting this bug.

The only changes that may be brought now to the 2.4 branch are security fixes
and nothing else. Bug fixes with a slight risk of changing legitimate behaviour
like the present one, on the other hand, will happen either on the 2.6 branch or
on the 2.7 one.

> To be clear, the problem is that when file_dealloc() sees an EOF, it
> assumes that an error has occurred.  It then checks errno.  However,
> it's possible for file_dealloc() to get an EOF under conditions where
> errno hasn't been set.  In that case, it reports an error with a stale
> value of errno.  This is explained in detail in the initial report.

You still haven't explained what the correct behaviour would be, though.
Must it report an error or not? If it must, then how should this specific error
be detected and how should the error message be phrased?

Thanks in advance.
msg72098 - (view) Author: johansen (johansen) Date: 2008-08-28 17:04
As I said before:

  check_and_flush calls ferror(3C) and then fflush(3C) on the FILE
  stream associated with the file object.  There's just one problem
  here.  If it finds an error that was previously encountered on the
  file stream, there's no guarantee that errno will be valid.
  Should an error be encountered in fflush(3C), errno will get set;
  however, the contents of errno are undefined should fflush()
  return successfully.

The problem is this:

The caller of check_and_flush() will check errno if check_and_flush
returns an error.  However, check_and_flush() checks two different error
sources.  If ferror() returns an error, you can't check errno, because
that routine doesn't set errno.  However, if fflush() returns an error,
it will set errno.

In its current form, no caller of check_and_flush() should check errno
if the check_and_flush doesn't return successfully, since there's no way
of knowing whether errno will be set or not.

That doesn't make a lot of sense to me, so I would probably re-write
this code.  However, since I'm not an expert in the inner workings of
the Python interpreter, it's hard for me to suggest just how to do this.
History
Date User Action Args
2011-06-01 20:42:24schmirsetnosy: + schmir
2008-08-28 22:27:45pitrousetnosy: - pitrou
2008-08-28 17:04:49johansensetmessages: + msg72098
2008-08-28 16:33:07pitrousetmessages: + msg72095
2008-08-28 16:24:32johansensetmessages: + msg72094
versions: + Python 2.4, - Python 2.7
2008-08-28 09:34:07pitrousetpriority: normal
nosy: + pitrou
messages: + msg72072
versions: + Python 2.7, - Python 2.4
2008-08-28 06:19:51lacasetnosy: + laca
2008-05-30 21:54:10johansencreate