This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author nirs
Recipients nirs
Date 2017-11-30.18:23:25
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1512066205.79.0.213398074469.issue32186@psf.upfronthosting.co.za>
In-reply-to
Content
Using io.FileIO can hang all threads when accessing an inaccessible NFS
server.

To reproduce this, you need to open the file like this:

    fd = os.open(filename, ...)
    fio = io.FileIO(fd, "r+", closefd=True)

Inside fileio_init, there is a checkfd call, calling fstat without releasing
the GIL. This will hang all threads.

The expected behavior is blocking only the thread blocked on the system call,
so a system stay responsive and can serve other tasks.

Here is the log showing this issue, created with the attached reproducer
script (fileio_nfs_test.py).

# python fileio_nfs_test.py mnt/fileio.out dumbo.tlv.redhat.com
2017-11-30 18:41:49,159 - (MainThread) - pid=3436
2017-11-30 18:41:49,159 - (MainThread) - Opening mnt/fileio.out
2017-11-30 18:41:49,160 - (MainThread) - OK fd=3
2017-11-30 18:41:49,161 - (MainThread) - Starting canary thread
2017-11-30 18:41:49,161 - (Canary) - Blocking access to storage
2017-11-30 18:41:49,169 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2017-11-30 18:41:49,169 - (Canary) - check 0
2017-11-30 18:41:49,169 - (MainThread) - Waiting until storage is blocked...
2017-11-30 18:41:50,170 - (Canary) - check 1
2017-11-30 18:41:51,170 - (Canary) - check 2
2017-11-30 18:41:52,171 - (Canary) - check 3
2017-11-30 18:41:53,171 - (Canary) - check 4
2017-11-30 18:41:54,172 - (Canary) - check 5
2017-11-30 18:41:55,172 - (Canary) - check 6
2017-11-30 18:41:56,172 - (Canary) - check 7
2017-11-30 18:41:57,173 - (Canary) - check 8
2017-11-30 18:41:58,173 - (Canary) - check 9
2017-11-30 18:41:59,174 - (MainThread) - Opening io.FileIO

Everything is hang now!

After some time I run this from another shell:
 iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP

And now the script is unblocked and finish.

2017-11-30 18:45:29,683 - (MainThread) - OK
2017-11-30 18:45:29,684 - (MainThread) - Creating mmap
2017-11-30 18:45:29,684 - (Canary) - check 10
2017-11-30 18:45:29,684 - (MainThread) - OK
2017-11-30 18:45:29,685 - (MainThread) - Filling mmap
2017-11-30 18:45:29,685 - (MainThread) - OK
2017-11-30 18:45:29,685 - (MainThread) - Writing mmap to storage
2017-11-30 18:45:29,719 - (MainThread) - OK
2017-11-30 18:45:29,719 - (MainThread) - Syncing
2017-11-30 18:45:29,719 - (MainThread) - OK
2017-11-30 18:45:29,720 - (MainThread) - Done

We have a canary thread logging every second. Once we tried to open
the FileIO, the canary thread stopped - this is possible only if the io
extension module was holding the GIL during a blocking call.

And here is the backtrace of the hang process in the kernel:

# cat /proc/3436/stack
[<ffffffffc03228e4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
[<ffffffffc03240d4>] __rpc_execute+0x154/0x410 [sunrpc]
[<ffffffffc03257b8>] rpc_execute+0x68/0xb0 [sunrpc]
[<ffffffffc0315776>] rpc_run_task+0xf6/0x150 [sunrpc]
[<ffffffffc07eb8c3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
[<ffffffffc07ec94c>] _nfs4_proc_getattr+0xcc/0xf0 [nfsv4]
[<ffffffffc07f88f2>] nfs4_proc_getattr+0x72/0xf0 [nfsv4]
[<ffffffffc07b543f>] __nfs_revalidate_inode+0xbf/0x310 [nfs]
[<ffffffffc07b5a55>] nfs_getattr+0x95/0x250 [nfs]
[<ffffffff81206146>] vfs_getattr+0x46/0x80
[<ffffffff812061c5>] vfs_fstat+0x45/0x80
[<ffffffff812068e4>] SYSC_newfstat+0x24/0x60
[<ffffffff81206ade>] SyS_newfstat+0xe/0x10
[<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

You cannot attach to the process with gdb, since it is in D state, but once
the process is unblocked, gbd takes control, and we see:

Thread 2 (Thread 0x7f97a2ea5700 (LWP 4799)):
#0  0x00007f97ab925a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f97ab925a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f97ab925b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f97abc455f5 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0
#4  0x00007f97abc11156 in PyEval_RestoreThread () from /lib64/libpython2.7.so.1.0
#5  0x00007f97a44f9086 in time_sleep () from /usr/lib64/python2.7/lib-dynload/timemodule.so
#6  0x00007f97abc18bb0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#7  0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#8  0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9  0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#10 0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#12 0x00007f97abba494d in function_call () from /lib64/libpython2.7.so.1.0
#13 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#14 0x00007f97abc135bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#15 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#16 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#17 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#18 0x00007f97abba4858 in function_call () from /lib64/libpython2.7.so.1.0
#19 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#20 0x00007f97abb8e995 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#21 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#22 0x00007f97abc117b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#23 0x00007f97abc496e2 in t_bootstrap () from /lib64/libpython2.7.so.1.0
#24 0x00007f97ab91fe25 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f97aaf4434d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f97ac10d740 (LWP 4798)):
#0  0x00007f97aaf35154 in _fxstat () from /lib64/libc.so.6
#1  0x00007f97a470ad32 in fileio_init () from /usr/lib64/python2.7/lib-dynload/_io.so
#2  0x00007f97abbd565f in type_call () from /lib64/libpython2.7.so.1.0
#3  0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#4  0x00007f97abc140f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#5  0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
Missing separate debuginfos, use: debuginfo-install python-2.7.5-58.el7.x86_64
---Type <return> to continue, or q <return> to quit---
#6  0x00007f97abc1b002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#7  0x00007f97abc3443f in run_mod () from /lib64/libpython2.7.so.1.0
#8  0x00007f97abc355fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
#9  0x00007f97abc36889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
#10 0x00007f97abc47a3f in Py_Main () from /lib64/libpython2.7.so.1.0
#11 0x00007f97aae6dc05 in __libc_start_main () from /lib64/libc.so.6
#12 0x000000000040071e in _start ()

Looking at python code - there are two helpers in fileio.c that call fstat
without releasing the GIL during the call:

- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L145
- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L161

And both helpers are called from fileio_init (the implementation of io.FileIO())

- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L332
- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L366


Reported by RHV user, see https://bugzilla.redhat.com/1518676
(the bug may be private, sorry if you cannot access it).
History
Date User Action Args
2017-11-30 18:23:26nirssetrecipients: + nirs
2017-11-30 18:23:25nirssetmessageid: <1512066205.79.0.213398074469.issue32186@psf.upfronthosting.co.za>
2017-11-30 18:23:25nirslinkissue32186 messages
2017-11-30 18:23:25nirscreate