classification
Title: io.FileIO hang all threads if fstat blocks on inaccessible NFS server
Type: behavior Stage: resolved
Components: Extension Modules Versions: Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, berker.peksag, nirs, stutzbach, vstinner
Priority: normal Keywords: patch

Created on 2017-11-30 18:23 by nirs, last changed 2017-12-09 08:27 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
fileio_nfs_test.py nirs, 2017-11-30 18:23
Pull Requests
URL Status Linked Edit
PR 4652 merged nirs, 2017-11-30 18:29
PR 4651 merged nirs, 2017-11-30 18:29
PR 4661 merged python-dev, 2017-12-01 01:19
Messages (9)
msg307335 - (view) Author: Nir Soffer (nirs) * Date: 2017-11-30 18:23
Using io.FileIO can hang all threads when accessing an inaccessible NFS
server.

To reproduce this, you need to open the file like this:

    fd = os.open(filename, ...)
    fio = io.FileIO(fd, "r+", closefd=True)

Inside fileio_init, there is a checkfd call, calling fstat without releasing
the GIL. This will hang all threads.

The expected behavior is blocking only the thread blocked on the system call,
so a system stay responsive and can serve other tasks.

Here is the log showing this issue, created with the attached reproducer
script (fileio_nfs_test.py).

# python fileio_nfs_test.py mnt/fileio.out dumbo.tlv.redhat.com
2017-11-30 18:41:49,159 - (MainThread) - pid=3436
2017-11-30 18:41:49,159 - (MainThread) - Opening mnt/fileio.out
2017-11-30 18:41:49,160 - (MainThread) - OK fd=3
2017-11-30 18:41:49,161 - (MainThread) - Starting canary thread
2017-11-30 18:41:49,161 - (Canary) - Blocking access to storage
2017-11-30 18:41:49,169 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2017-11-30 18:41:49,169 - (Canary) - check 0
2017-11-30 18:41:49,169 - (MainThread) - Waiting until storage is blocked...
2017-11-30 18:41:50,170 - (Canary) - check 1
2017-11-30 18:41:51,170 - (Canary) - check 2
2017-11-30 18:41:52,171 - (Canary) - check 3
2017-11-30 18:41:53,171 - (Canary) - check 4
2017-11-30 18:41:54,172 - (Canary) - check 5
2017-11-30 18:41:55,172 - (Canary) - check 6
2017-11-30 18:41:56,172 - (Canary) - check 7
2017-11-30 18:41:57,173 - (Canary) - check 8
2017-11-30 18:41:58,173 - (Canary) - check 9
2017-11-30 18:41:59,174 - (MainThread) - Opening io.FileIO

Everything is hang now!

After some time I run this from another shell:
 iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP

And now the script is unblocked and finish.

2017-11-30 18:45:29,683 - (MainThread) - OK
2017-11-30 18:45:29,684 - (MainThread) - Creating mmap
2017-11-30 18:45:29,684 - (Canary) - check 10
2017-11-30 18:45:29,684 - (MainThread) - OK
2017-11-30 18:45:29,685 - (MainThread) - Filling mmap
2017-11-30 18:45:29,685 - (MainThread) - OK
2017-11-30 18:45:29,685 - (MainThread) - Writing mmap to storage
2017-11-30 18:45:29,719 - (MainThread) - OK
2017-11-30 18:45:29,719 - (MainThread) - Syncing
2017-11-30 18:45:29,719 - (MainThread) - OK
2017-11-30 18:45:29,720 - (MainThread) - Done

We have a canary thread logging every second. Once we tried to open
the FileIO, the canary thread stopped - this is possible only if the io
extension module was holding the GIL during a blocking call.

And here is the backtrace of the hang process in the kernel:

# cat /proc/3436/stack
[<ffffffffc03228e4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
[<ffffffffc03240d4>] __rpc_execute+0x154/0x410 [sunrpc]
[<ffffffffc03257b8>] rpc_execute+0x68/0xb0 [sunrpc]
[<ffffffffc0315776>] rpc_run_task+0xf6/0x150 [sunrpc]
[<ffffffffc07eb8c3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
[<ffffffffc07ec94c>] _nfs4_proc_getattr+0xcc/0xf0 [nfsv4]
[<ffffffffc07f88f2>] nfs4_proc_getattr+0x72/0xf0 [nfsv4]
[<ffffffffc07b543f>] __nfs_revalidate_inode+0xbf/0x310 [nfs]
[<ffffffffc07b5a55>] nfs_getattr+0x95/0x250 [nfs]
[<ffffffff81206146>] vfs_getattr+0x46/0x80
[<ffffffff812061c5>] vfs_fstat+0x45/0x80
[<ffffffff812068e4>] SYSC_newfstat+0x24/0x60
[<ffffffff81206ade>] SyS_newfstat+0xe/0x10
[<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

You cannot attach to the process with gdb, since it is in D state, but once
the process is unblocked, gbd takes control, and we see:

Thread 2 (Thread 0x7f97a2ea5700 (LWP 4799)):
#0  0x00007f97ab925a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f97ab925a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f97ab925b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f97abc455f5 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0
#4  0x00007f97abc11156 in PyEval_RestoreThread () from /lib64/libpython2.7.so.1.0
#5  0x00007f97a44f9086 in time_sleep () from /usr/lib64/python2.7/lib-dynload/timemodule.so
#6  0x00007f97abc18bb0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#7  0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#8  0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9  0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#10 0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#12 0x00007f97abba494d in function_call () from /lib64/libpython2.7.so.1.0
#13 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#14 0x00007f97abc135bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#15 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#16 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#17 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#18 0x00007f97abba4858 in function_call () from /lib64/libpython2.7.so.1.0
#19 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#20 0x00007f97abb8e995 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#21 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#22 0x00007f97abc117b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#23 0x00007f97abc496e2 in t_bootstrap () from /lib64/libpython2.7.so.1.0
#24 0x00007f97ab91fe25 in start_thread () from /lib64/libpthread.so.0
#25 0x00007f97aaf4434d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f97ac10d740 (LWP 4798)):
#0  0x00007f97aaf35154 in _fxstat () from /lib64/libc.so.6
#1  0x00007f97a470ad32 in fileio_init () from /usr/lib64/python2.7/lib-dynload/_io.so
#2  0x00007f97abbd565f in type_call () from /lib64/libpython2.7.so.1.0
#3  0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#4  0x00007f97abc140f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#5  0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
Missing separate debuginfos, use: debuginfo-install python-2.7.5-58.el7.x86_64
---Type <return> to continue, or q <return> to quit---
#6  0x00007f97abc1b002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#7  0x00007f97abc3443f in run_mod () from /lib64/libpython2.7.so.1.0
#8  0x00007f97abc355fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
#9  0x00007f97abc36889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
#10 0x00007f97abc47a3f in Py_Main () from /lib64/libpython2.7.so.1.0
#11 0x00007f97aae6dc05 in __libc_start_main () from /lib64/libc.so.6
#12 0x000000000040071e in _start ()

Looking at python code - there are two helpers in fileio.c that call fstat
without releasing the GIL during the call:

- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L145
- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L161

And both helpers are called from fileio_init (the implementation of io.FileIO())

- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L332
- https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L366


Reported by RHV user, see https://bugzilla.redhat.com/1518676
(the bug may be private, sorry if you cannot access it).
msg307336 - (view) Author: Nir Soffer (nirs) * Date: 2017-11-30 18:28
Forgot to mention - reproducible with python 2.7.

Similar issues exists in python 3, but I did not try to reproduce since we
are using python 2.7.

I posted patches for both 2.7 and master:
- https://github.com/python/cpython/pull/4651
- https://github.com/python/cpython/pull/4652
msg307346 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-30 22:14
We already release the GIL when calling lseek() in fileio.c, in the portable_lseek() function. So it makes sense to also do it in _io_FileIO_readall_impl() in the same file. os.lseek() also releases the GIL. 

I found another functions which calls lseek() without releasing the GIL:

* the Windows implementation of new_mmap_object()
* _Py_DisplaySourceLine()
* fp_setreadl() of Parser/tokenizer.c

I'm not sure that these 3 functions should be modified. In case of doubt, I prefer to not touch the code.
msg307354 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-01 01:19
New changeset 6a89481680b921e7b317c29877bdda9a6031e5ad by Victor Stinner (Nir Soffer) in branch 'master':
bpo-32186: Release the GIL during lseek and fstat (#4652)
https://github.com/python/cpython/commit/6a89481680b921e7b317c29877bdda9a6031e5ad
msg307365 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-01 06:26
New changeset 8bcd41040a5f1f9b48a86d0e21f196e4b1f90e4b by Victor Stinner (Miss Islington (bot)) in branch '3.6':
bpo-32186: Release the GIL during lseek and fstat (GH-4652) (#4661)
https://github.com/python/cpython/commit/8bcd41040a5f1f9b48a86d0e21f196e4b1f90e4b
msg307821 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-07 20:25
New changeset 830daae1c82ed33deef0086b7b6323e5be0b0cc8 by Victor Stinner (Nir Soffer) in branch '2.7':
[2.7] bpo-32186: Release the GIL during fstat and lseek calls (#4651)
https://github.com/python/cpython/commit/830daae1c82ed33deef0086b7b6323e5be0b0cc8
msg307822 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-07 20:27
The bug has been fixed in Python 2.7, 3.6 and the master branch.

Thank you Nir Soffer for the bug report and the fix!
msg307882 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2017-12-09 07:16
This has been fixed in all active branches (2.7, 3.6 and master) so I think we can close it as 'fixed'. Thanks, Nir!
msg307883 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-09 08:27
(Oops, closing was my intent of my previous comment, but I forgot it,
thanks Berker.)
History
Date User Action Args
2017-12-09 08:27:30vstinnersetmessages: + msg307883
2017-12-09 07:16:34berker.peksagsetstatus: open -> closed

type: behavior

nosy: + berker.peksag
messages: + msg307882
resolution: fixed
stage: patch review -> resolved
2017-12-07 20:27:00vstinnersetmessages: + msg307822
versions: - Python 3.5, Python 3.8
2017-12-07 20:25:41vstinnersetmessages: + msg307821
2017-12-01 06:26:33vstinnersetmessages: + msg307365
2017-12-01 01:19:21python-devsetpull_requests: + pull_request4571
2017-12-01 01:19:01vstinnersetmessages: + msg307354
2017-11-30 22:14:56vstinnersetmessages: + msg307346
2017-11-30 18:29:45nirssetpull_requests: + pull_request4564
2017-11-30 18:29:37nirssetkeywords: + patch
stage: patch review
pull_requests: + pull_request4563
2017-11-30 18:28:08nirssetnosy: + vstinner, benjamin.peterson, stutzbach
messages: + msg307336
2017-11-30 18:23:25nirscreate