Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.FileIO hang all threads if fstat blocks on inaccessible NFS server #76367

Closed
nirs mannequin opened this issue Nov 30, 2017 · 9 comments
Closed

io.FileIO hang all threads if fstat blocks on inaccessible NFS server #76367

nirs mannequin opened this issue Nov 30, 2017 · 9 comments
Labels
3.7 (EOL) end of life extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@nirs
Copy link
Mannequin

nirs mannequin commented Nov 30, 2017

BPO 32186
Nosy @nirs, @vstinner, @benjaminp, @berkerpeksag
PRs
  • bpo-32186: Release the GIL during lseek and fstat #4652
  • [2.7] bpo-32186: Release the GIL during fstat and lseek calls #4651
  • [3.6] bpo-32186: Release the GIL during lseek and fstat (GH-4652) #4661
  • Files
  • fileio_nfs_test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-12-09.07:16:34.897>
    created_at = <Date 2017-11-30.18:23:25.708>
    labels = ['extension-modules', 'type-bug', '3.7']
    title = 'io.FileIO hang all threads if fstat blocks on inaccessible NFS server'
    updated_at = <Date 2017-12-09.08:27:30.020>
    user = 'https://github.com/nirs'

    bugs.python.org fields:

    activity = <Date 2017-12-09.08:27:30.020>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-12-09.07:16:34.897>
    closer = 'berker.peksag'
    components = ['Extension Modules']
    creation = <Date 2017-11-30.18:23:25.708>
    creator = 'nirs'
    dependencies = []
    files = ['47309']
    hgrepos = []
    issue_num = 32186
    keywords = ['patch']
    message_count = 9.0
    messages = ['307335', '307336', '307346', '307354', '307365', '307821', '307822', '307882', '307883']
    nosy_count = 5.0
    nosy_names = ['nirs', 'vstinner', 'benjamin.peterson', 'stutzbach', 'berker.peksag']
    pr_nums = ['4652', '4651', '4661']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue32186'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7']

    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Nov 30, 2017

    Using io.FileIO can hang all threads when accessing an inaccessible NFS
    server.

    To reproduce this, you need to open the file like this:

        fd = os.open(filename, ...)
        fio = io.FileIO(fd, "r+", closefd=True)

    Inside fileio_init, there is a checkfd call, calling fstat without releasing
    the GIL. This will hang all threads.

    The expected behavior is blocking only the thread blocked on the system call,
    so a system stay responsive and can serve other tasks.

    Here is the log showing this issue, created with the attached reproducer
    script (fileio_nfs_test.py).

    # python fileio_nfs_test.py mnt/fileio.out dumbo.tlv.redhat.com
    2017-11-30 18:41:49,159 - (MainThread) - pid=3436
    2017-11-30 18:41:49,159 - (MainThread) - Opening mnt/fileio.out
    2017-11-30 18:41:49,160 - (MainThread) - OK fd=3
    2017-11-30 18:41:49,161 - (MainThread) - Starting canary thread
    2017-11-30 18:41:49,161 - (Canary) - Blocking access to storage
    2017-11-30 18:41:49,169 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
    2017-11-30 18:41:49,169 - (Canary) - check 0
    2017-11-30 18:41:49,169 - (MainThread) - Waiting until storage is blocked...
    2017-11-30 18:41:50,170 - (Canary) - check 1
    2017-11-30 18:41:51,170 - (Canary) - check 2
    2017-11-30 18:41:52,171 - (Canary) - check 3
    2017-11-30 18:41:53,171 - (Canary) - check 4
    2017-11-30 18:41:54,172 - (Canary) - check 5
    2017-11-30 18:41:55,172 - (Canary) - check 6
    2017-11-30 18:41:56,172 - (Canary) - check 7
    2017-11-30 18:41:57,173 - (Canary) - check 8
    2017-11-30 18:41:58,173 - (Canary) - check 9
    2017-11-30 18:41:59,174 - (MainThread) - Opening io.FileIO

    Everything is hang now!

    After some time I run this from another shell:
    iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP

    And now the script is unblocked and finish.

    2017-11-30 18:45:29,683 - (MainThread) - OK
    2017-11-30 18:45:29,684 - (MainThread) - Creating mmap
    2017-11-30 18:45:29,684 - (Canary) - check 10
    2017-11-30 18:45:29,684 - (MainThread) - OK
    2017-11-30 18:45:29,685 - (MainThread) - Filling mmap
    2017-11-30 18:45:29,685 - (MainThread) - OK
    2017-11-30 18:45:29,685 - (MainThread) - Writing mmap to storage
    2017-11-30 18:45:29,719 - (MainThread) - OK
    2017-11-30 18:45:29,719 - (MainThread) - Syncing
    2017-11-30 18:45:29,719 - (MainThread) - OK
    2017-11-30 18:45:29,720 - (MainThread) - Done

    We have a canary thread logging every second. Once we tried to open
    the FileIO, the canary thread stopped - this is possible only if the io
    extension module was holding the GIL during a blocking call.

    And here is the backtrace of the hang process in the kernel:

    # cat /proc/3436/stack
    [<ffffffffc03228e4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
    [<ffffffffc03240d4>] __rpc_execute+0x154/0x410 [sunrpc]
    [<ffffffffc03257b8>] rpc_execute+0x68/0xb0 [sunrpc]
    [<ffffffffc0315776>] rpc_run_task+0xf6/0x150 [sunrpc]
    [<ffffffffc07eb8c3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
    [<ffffffffc07ec94c>] _nfs4_proc_getattr+0xcc/0xf0 [nfsv4]
    [<ffffffffc07f88f2>] nfs4_proc_getattr+0x72/0xf0 [nfsv4]
    [<ffffffffc07b543f>] __nfs_revalidate_inode+0xbf/0x310 [nfs]
    [<ffffffffc07b5a55>] nfs_getattr+0x95/0x250 [nfs]
    [<ffffffff81206146>] vfs_getattr+0x46/0x80
    [<ffffffff812061c5>] vfs_fstat+0x45/0x80
    [<ffffffff812068e4>] SYSC_newfstat+0x24/0x60
    [<ffffffff81206ade>] SyS_newfstat+0xe/0x10
    [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff

    You cannot attach to the process with gdb, since it is in D state, but once
    the process is unblocked, gbd takes control, and we see:

    Thread 2 (Thread 0x7f97a2ea5700 (LWP 4799)):
    #0 0x00007f97ab925a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
    #1 0x00007f97ab925a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
    #2 0x00007f97ab925b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
    #3 0x00007f97abc455f5 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0
    #4 0x00007f97abc11156 in PyEval_RestoreThread () from /lib64/libpython2.7.so.1.0
    #5 0x00007f97a44f9086 in time_sleep () from /usr/lib64/python2.7/lib-dynload/timemodule.so
    #6 0x00007f97abc18bb0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #7 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
    #8 0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #9 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
    #10 0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #11 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
    #12 0x00007f97abba494d in function_call () from /lib64/libpython2.7.so.1.0
    #13 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
    #14 0x00007f97abc135bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #15 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #16 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #17 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
    #18 0x00007f97abba4858 in function_call () from /lib64/libpython2.7.so.1.0
    #19 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
    #20 0x00007f97abb8e995 in instancemethod_call () from /lib64/libpython2.7.so.1.0
    #21 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
    #22 0x00007f97abc117b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
    #23 0x00007f97abc496e2 in t_bootstrap () from /lib64/libpython2.7.so.1.0
    #24 0x00007f97ab91fe25 in start_thread () from /lib64/libpthread.so.0
    #25 0x00007f97aaf4434d in clone () from /lib64/libc.so.6

    Thread 1 (Thread 0x7f97ac10d740 (LWP 4798)):
    #0 0x00007f97aaf35154 in _fxstat () from /lib64/libc.so.6
    #1 0x00007f97a470ad32 in fileio_init () from /usr/lib64/python2.7/lib-dynload/_io.so
    #2 0x00007f97abbd565f in type_call () from /lib64/libpython2.7.so.1.0
    #3 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
    #4 0x00007f97abc140f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
    #5 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
    Missing separate debuginfos, use: debuginfo-install python-2.7.5-58.el7.x86_64
    ---Type <return> to continue, or q <return> to quit---
    #6 0x00007f97abc1b002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
    #7 0x00007f97abc3443f in run_mod () from /lib64/libpython2.7.so.1.0
    #8 0x00007f97abc355fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
    #9 0x00007f97abc36889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
    #10 0x00007f97abc47a3f in Py_Main () from /lib64/libpython2.7.so.1.0
    #11 0x00007f97aae6dc05 in __libc_start_main () from /lib64/libc.so.6
    #12 0x000000000040071e in _start ()

    Looking at python code - there are two helpers in fileio.c that call fstat
    without releasing the GIL during the call:

    And both helpers are called from fileio_init (the implementation of io.FileIO())

    Reported by RHV user, see https://bugzilla.redhat.com/1518676
    (the bug may be private, sorry if you cannot access it).

    @nirs nirs mannequin added 3.8 only security fixes 3.7 (EOL) end of life extension-modules C modules in the Modules dir labels Nov 30, 2017
    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Nov 30, 2017

    Forgot to mention - reproducible with python 2.7.

    Similar issues exists in python 3, but I did not try to reproduce since we
    are using python 2.7.

    I posted patches for both 2.7 and master:

    @vstinner
    Copy link
    Member

    We already release the GIL when calling lseek() in fileio.c, in the portable_lseek() function. So it makes sense to also do it in _io_FileIO_readall_impl() in the same file. os.lseek() also releases the GIL.

    I found another functions which calls lseek() without releasing the GIL:

    • the Windows implementation of new_mmap_object()
    • _Py_DisplaySourceLine()
    • fp_setreadl() of Parser/tokenizer.c

    I'm not sure that these 3 functions should be modified. In case of doubt, I prefer to not touch the code.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 1, 2017

    New changeset 6a89481 by Victor Stinner (Nir Soffer) in branch 'master':
    bpo-32186: Release the GIL during lseek and fstat (bpo-4652)
    6a89481

    @vstinner
    Copy link
    Member

    vstinner commented Dec 1, 2017

    New changeset 8bcd410 by Victor Stinner (Miss Islington (bot)) in branch '3.6':
    bpo-32186: Release the GIL during lseek and fstat (GH-4652) (bpo-4661)
    8bcd410

    @vstinner
    Copy link
    Member

    vstinner commented Dec 7, 2017

    New changeset 830daae by Victor Stinner (Nir Soffer) in branch '2.7':
    [2.7] bpo-32186: Release the GIL during fstat and lseek calls (bpo-4651)
    830daae

    @vstinner
    Copy link
    Member

    vstinner commented Dec 7, 2017

    The bug has been fixed in Python 2.7, 3.6 and the master branch.

    Thank you Nir Soffer for the bug report and the fix!

    @vstinner vstinner removed the 3.8 only security fixes label Dec 7, 2017
    @berkerpeksag
    Copy link
    Member

    This has been fixed in all active branches (2.7, 3.6 and master) so I think we can close it as 'fixed'. Thanks, Nir!

    @berkerpeksag berkerpeksag added the type-bug An unexpected behavior, bug, or error label Dec 9, 2017
    @vstinner
    Copy link
    Member

    vstinner commented Dec 9, 2017

    (Oops, closing was my intent of my previous comment, but I forgot it,
    thanks Berker.)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants