This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: os.listdir hangs since opendir() and closedir() do not release GIL
Type: crash Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: theller Nosy List: Alexander.Belopolsky, brian.curtin, loewis, marcin.bachry, nikratio, pitrou, theller
Priority: normal Keywords: needs review, patch

Created on 2010-01-18 23:24 by nikratio, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hello_ll.c nikratio, 2010-01-18 23:24
libfuse26.py nikratio, 2010-01-18 23:25
libfuse28.py nikratio, 2010-01-18 23:25
call_hello.py nikratio, 2010-01-18 23:46
posix-opendir-and-gil.diff marcin.bachry, 2010-01-19 21:02 release GIL before opendir()
Messages (16)
msg98038 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-18 23:24
The following is a very subtle bug and it took me a couple of days to reduce it to a manageable test case. So please bear with me even if it's tedious.

The problem is that in some cases the use of ctypes for a callback function freezes the entire process. Even 'gdb -p' freezes when trying to attach to the affected PID. The process can not be killed with SIGTERM.

Please consider the attached example. It needs the fuse headers installed to work (package libfuse-dev under Debian).

Step 1: Compile hello_ll.c with   

gcc -Wall -g3 -O0 -fPIC `pkg-config fuse --cflags` -c hello_ll.c
gcc -Wall -g3 -O0 -shared `pkg-config fuse --libs` -o hello_ll.so hello_ll.o

into a shared library. The do_mount() function will be called with ctypes from call_hello.py.

Step 2: In call_hello.py you need to adjust the import of libfuse to match your installed FUSE version (2.6 or 2.8). The libfuse2{6,8}.py files have been autogenerated with ctypeslib. I can provide versions for other FUSE versions as well if necessary. Note that the problem is most likely not with the definitions in this file, I checked all of them carefully by hand.


Step 3: Create an empty directory 'mnt' in the working directory

Step 4: Run call_hello.py. It will mount a simple example FUSE file system in mnt, read the directory, umount the fs and exit.

Step 5: Uncomment line 36 of call_hello.py. This will substitute the C function hello_ll_opendir() (defined in hello_ll.c) by the Python function fuse_opendir(). Note that both functions do exactly the same thing, they just call the fuse function fuse_reply_err.

Step 6: Run call_hello.py again. The program will hang completely. You can salvage the mount point with "fusermount -u -z mnt".

Step 7: Comment out line 47 in call_hello.py, so that the process itself does not try to access the mountpoint anymore.

Step 8: Run call_hello.py again. Observe that the file system mounts and umounts fine. Accessing the mountpoint with a different process does not trigger the freeze either.


In summary, the process hangs only if:

1) The opendir function is implemented as a ctypes callback

and

2) the mounting process itself tries to access the mountpoint


I suspect that the ctypes callback may somehow deadlock with the thread that tries to access the mountpoint.

If someone can tell me how to get around the hanging gdb process, i'll be happy to provide a backtrace. I have already installed debugging symbols for both Python and libfuse.


Thank you for taking the time to read up to this point :-)
msg98041 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-18 23:45
Reproduced with Python 3.1
msg98067 - (view) Author: Marcin Bachry (marcin.bachry) Date: 2010-01-19 21:02
Here's the backtrace:

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x00007fcd0234a6e8 in PyThread_acquire_lock (lock=0x183f3f0, waitflag=128) at Python/thread_pthread.h:349
#2  0x00007fcd02311fa4 in PyEval_RestoreThread (tstate=0x17a9b00) at Python/ceval.c:356
#3  0x00007fcd02336a28 in PyGILState_Ensure () at Python/pystate.c:592
#4  0x00007fcd013fcec1 in _CallPythonObject (cif=0x183f3f0, resp=0x80, args=0x7fccff153f40, userdata=<value optimized out>)
    at /home/marcin/Desktop/python/python2.x-hg/Modules/_ctypes/callbacks.c:220
#5  closure_fcn (cif=0x183f3f0, resp=0x80, args=0x7fccff153f40, userdata=<value optimized out>)
    at /home/marcin/Desktop/python/python2.x-hg/Modules/_ctypes/callbacks.c:374
#6  0x00007fcd014044f0 in ffi_closure_unix64_inner (closure=<value optimized out>, rvalue=<value optimized out>, reg_args=0x246, argp=0x0)
    at /home/marcin/Desktop/python/python2.x-hg/Modules/_ctypes/libffi/src/x86/ffi64.c:566
#7  0x00007fcd01404b68 in ffi_closure_unix64 () at /home/marcin/Desktop/python/python2.x-hg/Modules/_ctypes/libffi/src/x86/unix64.S:230
#8  0x00007fccff572429 in ?? () from /lib/libfuse.so.2
#9  0x00007fccff5704fa in fuse_session_loop () from /lib/libfuse.so.2
#10 0x00007fccff1576e4 in ?? ()
#11 0x00007fccff154188 in ?? ()
#12 0x00007fccff3580e0 in ?? ()
#13 0x000000000192e2f0 in ?? ()
#14 0x0000000400000000 in ?? ()
#15 0x0000000000000001 in ?? ()
#16 0x00000000016ff5e0 in ?? ()
#17 0x0000000000000001 in ?? ()
#18 0x0000000000000000 in ?? ()

All calls to native POSIX functions seem to be wrapped with Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS except of opendir() and closedir(). When I fix it (as in attached patch), call_hello.py doesn't block anymore.
msg98068 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-19 21:32
Wow, great! Thanks for looking into this. How did you manage to get the backtrace? As I said, on my system gdb -p just hangs with when attaching.

I'm changing the component, since it seems that it's the os module that's at fault and not ctypes.
msg98073 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-20 11:11
Releasing the GIL when calling C library functions (e.g. opendir()) is not a requirement, it's just an optimization for slightly better multi-threading. Also, as shown in the backtrace, PyGILState_Ensure() is called which should only try to acquire the GIL if it isn't already held by the current thread; that is, it is designed to avoid the deadlock you are witnessing.

So, I would suggest the problem lies somewhere else. Does FUSE implicitly do a fork() of the current process? It's not obvious how the example shown here works.
msg98075 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-20 12:12
In this simple example, FUSE does not fork and does not start any threads.

Note that PyGILState_Ensure() cannot do anything here. What happens is this:

 - call_hello.py calls FUSE in a new thread, releasing the GIL.
 - FUSE mounts the file system and waits for requests
 - Meanwhile, in the main thread, call_hello.py calls opendir(), but does not release the GIL
 - FUSE receives the opendir() system call and calls the appropriate callback function
 - If the callback function is implemented in C, everything works fine.
 - If the callback function is implemented in Python, the FUSE thread tries to acquire the GIL to call the (Python) opendir() handler. But it cannot do so, because the GIL is still held by the main thread (which is waiting for the opendir syscall to return) => deadlock.
msg98076 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-20 12:19
Ah, thanks for the explanation. Yes indeed the patch looks ok for the
job. You should just be aware that similar problems may appear with
other system calls. I don't think we have ever considered that common C
calls such as opendir() could call back into Python code through
libraries such as FUSE.

By the way, it seems there are Python bindings for FUSE, in two
different flavours. You might want to look into them, and perhaps to
check whether they also experience this issue.

Le mercredi 20 janvier 2010 à 12:12 +0000, Nikolaus Rath a écrit :
> Nikolaus Rath <Nikolaus@rath.org> added the comment:
> 
> In this simple example, FUSE does not fork and does not start any threads.
> 
> Note that PyGILState_Ensure() cannot do anything here. What happens is this:
> 
>  - call_hello.py calls FUSE in a new thread, releasing the GIL.
>  - FUSE mounts the file system and waits for requests
>  - Meanwhile, in the main thread, call_hello.py calls opendir(), but does not release the GIL
>  - FUSE receives the opendir() system call and calls the appropriate callback function
>  - If the callback function is implemented in C, everything works fine.
>  - If the callback function is implemented in Python, the FUSE thread tries to acquire the GIL to call the (Python) opendir() handler. But it cannot do so, because the GIL is still held by the main thread (which is waiting for the opendir syscall to return) => deadlock.
msg98077 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-20 12:38
I have used both of them in the past, but in the end I wrote my own bindings (currently only available as part of the http://code.google.com/p/s3ql source code, but I intend to factor it out at some point), 
since neither fuse.py (http://code.google.com/p/fusepy) nor python-fuse (fuse.sf.net) support the FUSE low level API.

That said, I am reasonably sure that in both of them are affected by this bug too. However, it may not show up in practive very often, because the high level FUSE API by default forks into a background process before mounting the file system.
msg98078 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-20 12:45
On 01/20/2010 07:19 AM, Antoine Pitrou wrote:
> Ah, thanks for the explanation. Yes indeed the patch looks ok for the
> job. You should just be aware that similar problems may appear with
> other system calls. I don't think we have ever considered that common C
> calls such as opendir() could call back into Python code through
> libraries such as FUSE.

Well, now that I know what to look for, tracking down more of these 
problems should be significantly faster and easier. Are you generally 
going to accept similar patches for other unprotected syscalls?

Still, I'd be extremly grateful if someone could tell me the trick how 
to create a backtrace in such a deadlock situation... Or am I the only 
one for which a simple "gdb -a" does not work?

Best,

    -Nikolaus
msg98079 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-01-20 13:03
> Well, now that I know what to look for, tracking down more of these 
> problems should be significantly faster and easier. Are you generally 
> going to accept similar patches for other unprotected syscalls?

Until now the rule of thumb was to consider only time-consuming syscalls
(as I said, the primary goal is optimizing multi-threaded I/O). But I
guess we can accept such patches if they have no downsides. Martin, what
do you think?

> Still, I'd be extremly grateful if someone could tell me the trick how 
> to create a backtrace in such a deadlock situation... Or am I the only 
> one for which a simple "gdb -a" does not work?

Perhaps you can try to crash the process (using `kill -ABRT`) and then
load the core dump using gdb.
msg98082 - (view) Author: Marcin Bachry (marcin.bachry) Date: 2010-01-20 14:02
> Still, I'd be extremly grateful if someone could tell me the trick how 
to create a backtrace in such a deadlock situation

Sorry, I should have mentioned that. In order to get backtrace you let the process freeze, attach gdb to it (it will freeze too), then go to /sys/fs/fuse/connections/ and find your fuse connection (it's most likely the one with non-zero value in "waiting" file). Then do "echo 1 > abort" and go back to gdb.
msg98170 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-01-23 00:09
The patch works fine for me too. Also, I did not discover any other such problems for other syscalls (but I did not systematically try all os.* functions).
msg98210 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-01-24 09:19
I think the bug here is really in the application, which makes a system call to fuse and the fuse callback in the same process. This isn't supported, and IMO doesn't need to be. Python makes no promises that it will accept callbacks while a system call is in progress.

That said, working around this problem in this specific case with the proposed patch is fine with me.
msg99412 - (view) Author: Nikolaus Rath (nikratio) * Date: 2010-02-16 14:25
Does this patch still need review? Both Martin and Antoine already commented that the patch is ok, so it'd be great if someone could actually apply it...
msg99996 - (view) Author: Alexander Belopolsky (Alexander.Belopolsky) Date: 2010-02-24 02:24
Another "+1" on the patch if it is needed.  Posix module consistently releases the GIL while making system calls and this is a good idea independent of this particular issue.

Looking at the patch made me wonder whether existing

                Py_BEGIN_ALLOW_THREADS
                ep = readdir(dirp);
                Py_END_ALLOW_THREADS

is safe.  Apparently it is safe as was explained in

http://mail.python.org/pipermail/python-dev/2006-April/063808.html

because "The pointer returned by readdir() points to data which may be overwritten by another call to readdir() on the same directory stream. This data is not overwritten by another call to readdir() on a different directory stream."

It may be appropriate to add a comment near the readdir() explaining why it is safe in this context.  Clearly I am not the only one who found this non-obvious.
msg115589 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-04 17:27
Committed in r84489 (3.x), r84490 (3.1) and r84491 (2.7). Thank you!
History
Date User Action Args
2022-04-11 14:56:56adminsetgithub: 51985
2010-09-04 17:27:50pitrousetstatus: open -> closed
versions: - Python 2.6
messages: + msg115589

resolution: fixed
stage: patch review -> resolved
2010-02-24 02:24:54Alexander.Belopolskysetnosy: + Alexander.Belopolsky
messages: + msg99996
2010-02-16 14:25:12nikratiosetmessages: + msg99412
2010-01-24 09:19:13loewissetmessages: + msg98210
2010-01-23 00:09:15nikratiosetmessages: + msg98170
2010-01-20 14:02:51marcin.bachrysetmessages: + msg98082
2010-01-20 13:03:50pitrousetmessages: + msg98079
2010-01-20 12:45:22nikratiosetmessages: + msg98078
title: os.listdir hangs since opendir() and closedir() do not release GIL -> os.listdir hangs since opendir() and closedir() do not release GIL
2010-01-20 12:38:04nikratiosetmessages: + msg98077
2010-01-20 12:19:18pitrousetmessages: + msg98076
2010-01-20 12:12:11nikratiosetmessages: + msg98075
2010-01-20 11:11:40pitrousetnosy: + pitrou
messages: + msg98073
2010-01-20 10:57:10pitrousetnosy: + loewis
2010-01-19 21:36:42brian.curtinsetversions: + Python 3.2
nosy: + brian.curtin

priority: normal
keywords: + needs review
stage: patch review
2010-01-19 21:33:28nikratiosettitle: ctypes freezes/deadlocks process -> os.listdir hangs since opendir() and closedir() do not release GIL
2010-01-19 21:32:01nikratiosetnosy: theller, marcin.bachry, nikratio
messages: + msg98068
components: + Library (Lib), - ctypes
versions: + Python 2.7
2010-01-19 21:02:40marcin.bachrysetfiles: + posix-opendir-and-gil.diff

nosy: + marcin.bachry
messages: + msg98067

keywords: + patch
2010-01-18 23:46:27nikratiosetfiles: + call_hello.py
2010-01-18 23:46:21nikratiosetfiles: - call_hello.py
2010-01-18 23:45:26nikratiosetmessages: + msg98041
versions: + Python 3.1
2010-01-18 23:42:11nikratiosetfiles: - call_hello.py
2010-01-18 23:42:02nikratiosetfiles: + call_hello.py
2010-01-18 23:25:12nikratiosetfiles: + libfuse28.py
2010-01-18 23:25:05nikratiosetfiles: + libfuse26.py
2010-01-18 23:24:57nikratiosetfiles: + hello_ll.c
2010-01-18 23:24:50nikratiosetfiles: + call_hello.py
2010-01-18 23:24:09nikratiocreate