classification
Title: thread unsafe file objects cause crash
Type: crash Stage:
Components: Interpreter Core Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: anthonybaxter, christian.heimes, georg.brandl, gregory.p.smith, jafo, janixia, jhylton, jyasskin, kevinwatters, loewis, ned.deily, pebolle, pitrou, tim.peters, trent
Priority: critical Keywords: patch

Created on 2003-10-01 06:22 by janixia, last changed 2010-10-15 06:54 by pebolle. This issue is now closed.

Files
File name Uploaded Description Edit
filecrash.py anthonybaxter, 2003-10-01 07:03 filecrash.py
filethread1.patch pitrou, 2008-03-28 18:09
filethread2.patch pitrou, 2008-03-29 00:30
filethread3.patch pitrou, 2008-03-29 19:39
filethread4.patch pitrou, 2008-03-29 20:39
filethread4-gps01.patch gregory.p.smith, 2008-04-06 06:25
python-2.7-Doc-gil.patch pebolle, 2010-10-14 20:48 minor documentation fixes
Messages (28)
msg18469 - (view) Author: Jan Olderdissen (janixia) Date: 2003-10-01 06:22
Note: This may be a dupe or a generalization of 595601.




Running below code snippet on 2.3.1 release and debug 
build on Windows 2000/XP a few times will inevitably lead 
to a crash. 2.2.2 also exhibits this behavior.




The crashing code:




------------


import thread




f=open("tmp1", "w")




def worker():


    global f


    while 1:


        f.close()


        f=open("tmp1", "w")


        f.seek (0, 0)




thread.start_new_thread(worker, ())


thread.start_new_thread(worker, ())


while 1: pass


-------------




The issue appears to be this (and similar) code sections 
from fileobject.c:




	Py_BEGIN_ALLOW_THREADS


	errno = 0;


	ret = _portable_fseek(f->f_fp, offset, whence);


	Py_END_ALLOW_THREADS




Note that due to 
Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS, 
f->f_fp can be set to NULL by file_close prior to entering 
_portable_fseek. Similar crashes can be observed with 
read, write and close itself when they are mixed with a 
concurrent close.




Obviously, the offending python code is buggy and a lock 
should be used to prevent concurrent access to f. 
However, it seems unreasonable for Python to crash 
because of it.




A mutex preventing concurrent access to the file object 
seems like a reasonable way to fix this.
msg18470 - (view) Author: Anthony Baxter (anthonybaxter) (Python triager) Date: 2003-10-01 07:03
Logged In: YES 
user_id=29957

Attaching the failing code as an attachment, because SF
mangles source code.
msg18471 - (view) Author: Jan Olderdissen (janixia) Date: 2003-10-01 07:08
Logged In: YES 
user_id=877914

My apologies for the mangled source. I suppose there isn't a 
way for me to remedy the situation.
msg18472 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-10-04 08:01
Logged In: YES 
user_id=21627

janixia, don't worry about the formatting - this is
partially SF's fault, too.

Would you be interested in developing and testing a patch? I
think it would be sufficient to move the f->f_fp access out
of the GIL releasage.
msg18473 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2003-10-06 03:48
Logged In: YES 
user_id=31392

Patch 595601 is an attempt to address this problem, but it's
incomplete.  The file object API allows an extension to
extract to FILE * and squirrel it away.  That's clearly
unsafe, because it can't participate in a locking scheme
without re-writing extensions.

Shane Hathaway proposed another solution here:
http://mail.python.org/pipermail/python-dev/2003-June/036543.html

The problem in this case is that we cause the call to
close() to raise an exception.  I'd prefer to see the
exception raised elsewhere, because close() seldom fails and
is often closed from routines that are cleaning up at the
end.  On the other hand, this solution would be easier to
implementation, so I'm at least +0 on it.

Let's do one or the other.
msg18474 - (view) Author: Jan Olderdissen (janixia) Date: 2003-10-06 23:02
Logged In: YES 
user_id=877914

I'm inclined to go with Shane's suggested solution of 
reference counting when file access is in progress. It requires 
no synchronisation (increment, decrement and check are 
outside global lock release) and should have the smallest 
performance impact.

I don't think the FILE * extraction problem can be solved at 
all. Once the horse it out of the barn... However, for 
the "standard" case Shane's suggestion provides a neat and 
clean solution for the problem.

If the community can agree on this solution, I could be talked 
into implementing it.
msg55960 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-17 10:15
If I read the 2003 python-dev thready correctly, there isn't a solution
to this.  Does this need to go back to python-dev, or do we just call it
"wont fix"?  Or...?
msg57338 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-11-09 22:14
I'm still able to reproduce the bug in Python 2.5 (svn) and 2.6 (trunk).

import thread
f=open("tmp1", "w")

def worker():
    global f
    while 1:
        f.close()
        f=open("tmp1", "w")
        f.seek(0,0)

thread.start_new_thread(worker, ())
thread.start_new_thread(worker, ())

Unhandled exception in thread started by <function worker at 0xb7d01aac>
Traceback (most recent call last):
*** glibc detected *** ./python: malloc(): memory corruption: 0xb7efc008 ***
======= Backtrace: =========
/lib/tls/i686/cmov/libc.so.6[0xb7dbe636]
/lib/tls/i686/cmov/libc.so.6(__libc_malloc+0x90)[0xb7dbffc0]
/lib/tls/i686/cmov/libc.so.6[0xb7dad03f]
/lib/tls/i686/cmov/libc.so.6(fopen64+0x2c)[0xb7daf61c]
./python(PyTraceBack_Print+0x1a4)[0x80ef0f4]
./python(PyErr_Display+0x76)[0x80e73a6]
./python[0x80ed80d]
./python(PyObject_Call+0x27)[0x805c927]
./python(PyEval_CallObjectWithKeywords+0x6c)[0x80c151c]
./python(PyErr_PrintEx+0xbe)[0x80e7e9e]
./python[0x80f37b1]
/lib/tls/i686/cmov/libpthread.so.0[0xb7ed146b]
/lib/tls/i686/cmov/libc.so.6(clone+0x5e)[0xb7e276de]
======= Memory map: ========
08048000-0813d000 r-xp 00000000 fe:01 10586072  
/home/heimes/dev/python/release25-maint/python
0813d000-08162000 rw-p 000f4000 fe:01 10586072  
/home/heimes/dev/python/release25-maint/python
08162000-081fe000 rw-p 08162000 00:00 0          [heap]
b6a00000-b6a21000 rw-p b6a00000 00:00 0
b6a21000-b6b00000 ---p b6a21000 00:00 0
b6bc1000-b6bc2000 ---p b6bc1000 00:00 0
b6bc2000-b73c2000 rw-p b6bc2000 00:00 0
b73c2000-b73c3000 ---p b73c2000 00:00 0
b73c3000-b7bc3000 rw-p b73c3000 00:00 0
b7bc3000-b7bff000 r-xp 00000000 08:05 325941     /lib/libncurses.so.5.6
b7bff000-b7c07000 rw-p 0003b000 08:05 325941     /lib/libncurses.so.5.6
b7c07000-b7c4e000 r-xp 00000000 08:05 325837     /lib/libncursesw.so.5.6
b7c4e000-b7c56000 rw-p 00046000 08:05 325837     /lib/libncursesw.so.5.6
b7c56000-b7c82000 r-xp 00000000 08:05 325955     /lib/libreadline.so.5.2
b7c82000-b7c86000 rw-p 0002c000 08:05 325955     /lib/libreadline.so.5.2
b7c86000-b7c87000 rw-p b7c86000 00:00 0
b7c87000-b7c8a000 r-xp 00000000 fe:01 10716611  
/home/heimes/dev/python/release25-maint/build/lib.linux-i686-2.5/readline.so
b7c8a000-b7c8b000 rw-p 00003000 fe:01 10716611  
/home/heimes/dev/python/release25-maint/build/lib.linux-i686-2.5/readline.so
b7c8b000-b7c92000 r--s 00000000 08:05 557857    
/usr/lib/gconv/gconv-modules.cache
b7c92000-b7cd1000 r--p 00000000 08:05 570306    
/usr/lib/locale/de_DE.utf8/LC_CTYPE
b7cd1000-b7d54000 rw-p b7cd1000 00:00 0
b7d54000-b7e98000 r-xp 00000000 08:05 326311    
/lib/tls/i686/cmov/libc-2.6.1.so
b7e98000-b7e99000 r--p 00143000 08:05 326311    
/lib/tls/i686/cmov/libc-2.6.1.so
b7e99000-b7e9b000 rw-p 00144000 08:05 326311    
/lib/tls/i686/cmov/libc-2.6.1.so
b7e9b000-b7e9e000 rw-p b7e9b000 00:00 0
b7e9e000-b7ec1000 r-xp 00000000 08:05 326315    
/lib/tls/i686/cmov/libm-2.6.1.so
b7ec1000-b7ec3000 rw-p 00023000 08:05 326315    
/lib/tls/i686/cmov/libm-2.6.1.so
b7ec3000-b7ec5000 r-xp 00000000 08:05 326330    
/lib/tls/i686/cmov/libutil-2.6.1.so
b7ec5000-b7ec7000 rw-p 00001000 08:05 326330    
/lib/tls/i686/cmov/libutil-2.6.1.so
b7ec7000-b7ec8000 rw-p b7ec7000 00:00 0
b7ec8000-b7eca000 r-xp 00000000 08:05 326314    
/lib/tls/i686/cmov/libdl-2.6.1.so
b7eca000-b7ecc000 rw-p 00001000 08:05 326314    
/lib/tls/i686/cmov/libdl-2.6.1.so
b7ecc000-b7ee0000 r-xp 00000000 08:05 326325    
/lib/tls/i686/cmov/libpthread-2.6.1.so
b7ee0000-b7ee2000 rw-p 00013000 08:05 326325    
/lib/tls/i686/cmov/libpthread-2.6.1.so
b7ee2000-b7ee4000 rw-p b7ee2000 00:00 0
b7ef1000-b7efb000 r-xp 00000000 08:05 325908     /lib/libgcc_s.so.1
b7efb000-b7efc000 rw-p 0000a000 08:05 325908     /lib/libgcc_s.so.1
b7efc000-b7f01000 rw-p b7efc000 00:00 0
b7f01000-b7f1b000 r-xp 00000000 08:05 326530     /lib/ld-2.6.1.so
b7f1b000-b7f1d000 rw-p 00019000 08:05 326530     /lib/ld-2.6.1.so
bfcd2000-bfcee000 rw-p bfcd2000 00:00 0          [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]

However Python 3.0 doesn't crash:

Unhandled exception in thread started by <function worker at 0x840860c>
Traceback (most recent call last):
  File "<stdin>", line 6, in worker
  File "/home/heimes/dev/python/py3k/Lib/io.py", line 1234, in seek
    self.buffer.seek(pos)
  File "/home/heimes/dev/python/py3k/Lib/io.py", line 877, in seek
    return self.raw.seek(pos, whence)
IOError: [Errno 9] Bad file descriptor
msg64579 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-27 11:00
A small addition to Christian's code snippet allows me to reproduce the
problem as well:

import thread

f=open("tmp1", "w")

def worker():
    global f
    while 1:
        f.close()
        f = open("tmp1", "w")
        f.seek(0,0) 

thread.start_new_thread(worker, ())
thread.start_new_thread(worker, ())

while 1:
    pass
msg64642 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-28 18:09
This is a preliminary patch which shows how things might be done better.
It only addresses close(), seek() and dealloc right now. However, as
mentioned in test_close_open_seek, if I raise the number of workers, I
get crashes (while test_close_open is fine). Perhaps fseek() in the
glibc is thread unsafe when operating on the same file descriptor?
msg64643 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-28 18:20
Another approach would be to add a dedicated lock for each PyFileObject.
This sounds a bit bad performance-wise but after all the GIL itself is a
lock, and we release and re-acquire it on each file operation, so why not?
msg64645 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-28 18:47
On the other hand, surprisingly enough, the flockfile/funlockfile
manpage tells me that:

       The stdio functions are thread-safe.  This is achieved by 
assigning  to  each  FILE
       object  a  lockcount  and  (if the lockcount is nonzero) an
owning thread.  For each
       library call, these functions wait until the FILE object is no
longer  locked  by  a
       different thread, then lock it, do the requested I/O, and unlock
the object again.

This leaves me wondering what is happening in the above-mentioned test.
msg64654 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-28 22:45
Why hadn't I read #595601 in detail, it has an explanation: 

[quoting Jeremy Hylton]
The universal newline code is squirrels the FILE * in a
local variable, which is worse.  If it happens that
another thread closes the file, at best the local
points to a closed FILE *.  But that memory could get
recycled and then there's no way to know what it points to.
[/quoting]

Even with careful coding, there's a small window between releasing the
GIL on our side, and acquiring the FILE-specific lock in the glibc,
during which the fclose() function can be invoked and release the FILE
just before we invoke another function (e.g. fseek()) on it.
msg64661 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-29 00:30
Some good news: I've found a way to continue with my approach and make
it working. The close() method now raises an appropriate IOError when
another method is being called from another thread. Attaching a patch,
which protects a bunch of file object methods (together with tests). 

Now the bad news: the protection logic in fileobject.c is, hmm, a bit
contrived (and I'm not even sure it's 100% correct). If someone wants to
read it and put his veto before I go further...
msg64667 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-03-29 01:27
Closed #595601 as a duplicate.
msg64719 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-29 19:39
Actually, my approach was not 100% correct, it failed in some rare
cases. I've come to the conclusion that using an unlock count on the
PyFileObject is the correct way to proceed. I'm now attaching a working
and complete patch which protects all methods of the PyFileObject. The
original test suite runs fine, as well as the added test cases and Tim
Peters' crasher here:
http://mail.python.org/pipermail/python-dev/2003-June/036537.html

To sum up the changes brought by this patch:
- no supplementary locking
- but each time we release the GIL to do an operation on a FILE, we
increase a dedicated counter on the PyFileObject
- when close()ing a PyFileObject, if the aforementioned counter is
non-zero, we throw an IOError rather than risking calling fclose(). By
construction this cannot happen in the PyFileObject destructor, but if
ever it happens (for example if a C extension decides to put its hands
in the PyFileObject struct), we throw a SystemError instead.
msg64724 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-29 20:39
Ah, I had forgotten to protect the print statement as well. Here is a
new patch :-)
msg65017 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-04-06 03:07
I'm reviewing this patch now and plan to commit it after some testing.

A couple comments:

I'd rename your sts variables to status.

Also FYI:

Your use of volatile on the int unlocked_count member of PyFileObject
does not do what you think it does and isn't needed here anyways.

Access to the variable is always protected by the GIL unlocking and
locking of which should cause an implicit memory barrier guaranteeing
that all other CPUs in the system will see the same value stored in the
structure in memory.

The C volatile keyword on the other hand does not guarantee this. 
volatile is useful for memory mapped IO but it makes no guarantees about
cache coherent access between multiple CPUs.  (the atomic types in the
recent C++ standards are meant for that)


Both of the above are trivial changes, no need for another patch.
msg65022 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-04-06 06:25
I've attached my patch that I want to commit.  The main change from
filethread4 is some cleanup in file_test to make it run a lot faster and
add verbose mode output to indicate how well it is actually testing the
problem (counting the times that close raises IOError).

One concern holding up my commit:

Will this test pass on windows?  It is opening and closing the same file
in 'w+' mode from multiple threads of the same process at once.

Can someone with a windows dev environment please apply this patch and
test it.  If it dislikes the above file behavior, can you propose a fix
for it (set windows file non-exclusive flags or whatever you're supposed
to do... the worse alternative would be to use a new filename on each
open but that could cause a nightmare of thousands of new files being
created by the test which then have to be cleaned up)?

thanks,
-gps
msg65029 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-06 10:40
Patched and tested on one of my buildbots, test_file passes without 
error with your latest Patch Greg.
msg65031 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-04-06 11:09
Ok Greg, I wasn't sure locking/unlocking the GIL would create a memory
barrier but it sounds logical after all. Your patch looks fine to me.
msg65056 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-04-06 23:14
Committed to trunk in revision 62195.

Misc/NEWS entry added.

I also added two new C API functions: PyFile_IncUseCount and
PyFile_DecUseCount along with documentation.  They should be used by any
C extension code that uses PyFile_AsFile and wants to make use of the
returned FILE* with the GIL released.

The net effect of not using them is no change from the existing behavior
(crashes would be possible) for those C extension modules.
msg71388 - (view) Author: Kevin Watters (kevinwatters) Date: 2008-08-18 21:48
I know this is long closed, but no one on the nosy list happens to have
this fix backported to 2.5, do they? :) If not, I'll attach one here
eventually...
msg71389 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-18 21:52
> I know this is long closed, but no one on the nosy list happens to have
> this fix backported to 2.5, do they? :)

I think that at the time no one was sure the patch was 100% harmless. It
also subtly changes the behaviour of close() when called while another
IO operation is in progress in another thread, which is arguably a bug
fix but can still raise an exception it wouldn't have raised in 2.5.

So all in all I'm not sure this should be backported, although it would
probably be an improvement in most cases. I'll let someone else take the
decision.
msg71474 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-19 19:57
The fix can not be committed to Python 2.5 because it breaks
compatibility by adding another field to the PyFileObject struct and
adding two new C API functions.
msg118715 - (view) Author: Paul Bolle (pebolle) Date: 2010-10-14 20:48
0) I ran into some (small) problems with the documentation added by revision 62195. It seems more efficient to reuse this issue to report these. Feel free to ask me to open another issue if that's not appreciated.

1) A small patch that addresses two problems with the current (ie, 2.7) documentation should be attached:
- link three occurrences of "GIL" to the GIL entry in the glossary; and
- add some example code to clarify the usage of PyFile_IncUseCount() andPyFile_DecUseCount().

2) That patch also adds a link to the "Thread State and the Global Interpreter Lock" section to the GIL entry in the Documentation index. That is a separate problem. But fixing that minor problem could also be of help to people (like me) that need to better understand the GIL aspects of those two functions.
msg118723 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-10-14 22:09
Paul, please open a new issue and attach your patch(s) there; it should address issues in 2.7 and 3.2 only.
msg118744 - (view) Author: Paul Bolle (pebolle) Date: 2010-10-15 06:54
> please open a new issue and attach your patch(s) there

Issue 10111 now tracks the documentation problems.
History
Date User Action Args
2010-10-15 06:54:52pebollesetmessages: + msg118744
2010-10-14 22:09:45ned.deilysetnosy: + ned.deily
messages: + msg118723
2010-10-14 20:48:58pebollesetfiles: + python-2.7-Doc-gil.patch
nosy: + pebolle
messages: + msg118715

2008-08-19 19:57:27gregory.p.smithsetmessages: + msg71474
2008-08-18 21:52:47pitrousetmessages: + msg71389
2008-08-18 21:48:43kevinwatterssetnosy: + kevinwatters
messages: + msg71388
2008-04-06 23:14:16gregory.p.smithsetstatus: open -> closed
resolution: accepted
messages: + msg65056
versions: - Python 3.0
2008-04-06 11:09:26pitrousetmessages: + msg65031
2008-04-06 10:40:51trentsetnosy: + trent
messages: + msg65029
2008-04-06 06:25:10gregory.p.smithsetfiles: + filethread4-gps01.patch
messages: + msg65022
2008-04-06 03:07:15gregory.p.smithsetassignee: tim.peters -> gregory.p.smith
versions: + Python 3.0
messages: + msg65017
nosy: + gregory.p.smith
2008-03-31 20:48:34jyasskinsetnosy: + jyasskin
2008-03-29 20:39:27pitrousetfiles: + filethread4.patch
messages: + msg64724
2008-03-29 19:39:33pitrousetfiles: + filethread3.patch
messages: + msg64719
2008-03-29 01:27:31georg.brandllinkissue595601 superseder
2008-03-29 01:27:19georg.brandlsetnosy: + georg.brandl
messages: + msg64667
2008-03-29 00:30:35pitrousetfiles: + filethread2.patch
messages: + msg64661
2008-03-28 22:45:07pitrousetmessages: + msg64654
2008-03-28 18:47:37pitrousetmessages: + msg64645
2008-03-28 18:20:44pitrousetmessages: + msg64643
2008-03-28 18:09:52pitrousetfiles: + filethread1.patch
keywords: + patch
messages: + msg64642
2008-03-27 11:00:07pitrousetnosy: + pitrou
messages: + msg64579
2008-01-04 00:06:38christian.heimessetversions: - Python 2.3
2007-11-09 22:14:34christian.heimessetnosy: + christian.heimes
messages: + msg57338
versions: + Python 2.6, Python 2.5
2007-09-17 10:16:00jafosetassignee: tim.peters
messages: + msg55960
nosy: + jafo, tim.peters
2007-08-28 18:34:36georg.brandlsetpriority: normal -> critical
type: crash
severity: normal -> major
2007-08-28 18:33:45georg.brandllinkissue1778376 superseder
2003-10-01 06:22:02janixiacreate