Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some fstat() calls do not release the GIL, possibly hanging all threads #77202

Closed
nirs mannequin opened this issue Mar 7, 2018 · 17 comments
Closed

Some fstat() calls do not release the GIL, possibly hanging all threads #77202

nirs mannequin opened this issue Mar 7, 2018 · 17 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@nirs
Copy link
Mannequin

nirs mannequin commented Mar 7, 2018

BPO 33021
Nosy @Yhg1s, @brettcannon, @ncoghlan, @nirs, @pitrou, @vstinner, @rbtcollins, @pmp-p, @ericsnowcurrently, @serhiy-storchaka, @1st1, @MojoVampire, @ZackerySpytz
PRs
  • bpo-33021: Release the GIL during fstat() calls #6019
  • [2.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) #6020
  • bpo-33021: Fix GCC 7 warning (-Wmaybe-uninitialized) in mmapmodule.c #6117
  • [3.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) #6159
  • [3.6] bpo-33021: Release the GIL during fstat() calls (GH-6019) #6160
  • Files
  • mmap_nfs_test.py
  • mmap_size_nfs_test.py
  • fdopen_nfs_test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-03-20.19:40:42.831>
    created_at = <Date 2018-03-07.23:03:24.256>
    labels = ['3.8', '3.7', 'library', 'performance']
    title = 'Some fstat() calls do not release the GIL, possibly hanging all threads'
    updated_at = <Date 2018-03-20.19:40:42.831>
    user = 'https://github.com/nirs'

    bugs.python.org fields:

    activity = <Date 2018-03-20.19:40:42.831>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-03-20.19:40:42.831>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2018-03-07.23:03:24.256>
    creator = 'nirs'
    dependencies = []
    files = ['47490', '47491', '47492']
    hgrepos = []
    issue_num = 33021
    keywords = ['patch']
    message_count = 17.0
    messages = ['313407', '313409', '313417', '313419', '313434', '313627', '313836', '313837', '313839', '313980', '313981', '313982', '314024', '314083', '314158', '314159', '314161']
    nosy_count = 13.0
    nosy_names = ['twouters', 'brett.cannon', 'ncoghlan', 'nirs', 'pitrou', 'vstinner', 'rbcollins', 'pmpp', 'eric.snow', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'ZackerySpytz']
    pr_nums = ['6019', '6020', '6117', '6159', '6160']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue33021'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Mar 7, 2018

    If the file descriptor is on a non-responsive NFS server, calling
    fstat() can block for long time, hanging all threads. Most of the fstat()
    calls release the GIL around the call, but some calls seems to be
    forgotten.

    In python 3, the calls are handled now by _py_fstat(), releasing the GIL
    internally, but some calls use _py_fstat_noraise() which does not release
    the GIL. Most of the calls to _py_fstat_noraise() release the GIL around
    the call, except these 2 calls, affecting users of:

    • mmap.mmap()
    • os.urandom()
    • random.seed()

    In python there are more fstat() calls to fix, affecting users of:

    • imp.load_dynamic()
    • imp.load_source()
    • mmap.mmap()
    • mmapobject.size()
    • os.fdopen()
    • os.urandom()
    • random.seed()

    @nirs nirs mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Mar 7, 2018
    @pmp-p
    Copy link
    Mannequin

    pmp-p mannequin commented Mar 7, 2018

    is fstat thread safe ?

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Mar 8, 2018

    fstat is async signal safe, and I suspect it's thread safe in general, though usage does rely on the file descriptor remaining valid. If the fd in question is closed in another thread after the GIL is released, fstat would fail; if a new file is opened and assigned the same fd, it would give erroneous results. But I don't think any of that would lead to fundamentally incorrect behavior (after all, you could do the same thing manually by caching an fd then closing the file).

    @pmp-p
    Copy link
    Mannequin

    pmp-p mannequin commented Mar 8, 2018

    josh.r. i think you are right, i was worried if a nfs sillyrename is in progress, for eg a lock file ,then server hangs but thread lift the GIL and allow another thread to try to start to fstat the same path.

    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Mar 8, 2018

    Python cannot protect raw file descriptor from bad multi-threaded
    application. For example the application may close a file descriptor twice
    which may lead to closing unrelated file descriptor created by another
    thread just after it was closed, before the second close.

    This issue affects all function using raw file descriptors, and we cannot
    protect them with the GIL.

    Even if fstat was not thread safe, we cannot protect it using the GIl
    since this blocks the entire application until fstat returns.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 11, 2018

    New changeset 4484f9d by Antoine Pitrou (Nir Soffer) in branch 'master':
    bpo-33021: Release the GIL during fstat() calls (GH-6019)
    4484f9d

    @ZackerySpytz
    Copy link
    Mannequin

    ZackerySpytz mannequin commented Mar 14, 2018

    Commit 4484f9d causes GCC 7.2.0 to emit a warning.

    cpython/Modules/mmapmodule.c: In functionnew_mmap_object’:
    cpython/Modules/mmapmodule.c:1126:18: warning: ‘fstat_resultmay be used uninitialized in this function [-Wmaybe-uninitialized]
         if (fd != -1 && fstat_result == 0 && S_ISREG(status.st_mode)) {

    A PR is attached.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 14, 2018

    New changeset d6e1404 by Antoine Pitrou (Zackery Spytz) in branch 'master':
    bpo-33021: Fix GCC 7 warning (-Wmaybe-uninitialized) in mmapmodule.c (bpo-6117)
    d6e1404

    @pitrou
    Copy link
    Member

    pitrou commented Mar 14, 2018

    With Benjamin we've decided that this wouldn't happen in 2.7. Performance improvements don't warrant a backport.

    Thank you Nir for reporting and posting the patches!

    @pitrou pitrou removed the 3.7 (EOL) end of life label Mar 14, 2018
    @pitrou pitrou closed this as completed Mar 14, 2018
    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Mar 16, 2018

    Antoine, thanks for fixing this on master! but I don't think this issue
    can be closed yet.

    First, the issue is not a performance but reliability. I probably made
    bad choice when I marked this as performance.

    When you call mmap.mmap() in one thread, the entire process hangs for
    an hour because the file descriptor is on a non-responsive NFS server.

    With the fix, only the thread accessing the file descriptor is affected.
    The rest of the system can function normally.

    Second, the issue affects python 2.7, which is the production version on
    many servers, and will be for many years e.g. on RHEL/CentOS 7. I think
    it is important to fix this issue for these users.

    Here is examples of the issue using reproducer scripts I uploaded to the
    bug.

    When mmap.mmap block, the entire process hangs. I unblocked the process from
    another shell by removing the iptables rule.

    # python bpo-33021/mmap_nfs_test.py mnt dumbo.tlv.redhat.com
    2018-03-17 01:17:57,846 - (MainThread) - Starting canary thread
    2018-03-17 01:17:57,846 - (Canary) - Blocking access to storage
    2018-03-17 01:17:57,857 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
    2018-03-17 01:17:57,857 - (Canary) - check 0
    2018-03-17 01:17:58,858 - (Canary) - check 1
    2018-03-17 01:17:59,858 - (Canary) - check 2
    2018-03-17 01:18:00,859 - (Canary) - check 3
    2018-03-17 01:18:01,859 - (Canary) - check 4
    2018-03-17 01:18:02,859 - (Canary) - check 5
    2018-03-17 01:18:03,860 - (Canary) - check 6
    2018-03-17 01:18:04,860 - (Canary) - check 7
    2018-03-17 01:18:05,861 - (Canary) - check 8
    2018-03-17 01:18:06,861 - (Canary) - check 9
    2018-03-17 01:18:07,862 - (Canary) - check 10
    2018-03-17 01:18:07,868 - (MainThread) - Calling mmap.mmap

    (I remove the iptables rule here)

    2018-03-17 01:18:57,683 - (MainThread) - OK
    2018-03-17 01:18:57,683 - (MainThread) - Done
    2018-03-17 01:18:57,683 - (Canary) - check 11

    When mmapobject.size() was called, the entire process was hang. I unblocked the
    process from another shell by removing the iptables rule.

    # python bpo-33021/mmap_size_nfs_test.py mnt dumbo.tlv.redhat.com
    2018-03-17 01:22:17,991 - (MainThread) - Starting canary thread
    2018-03-17 01:22:17,992 - (Canary) - Blocking access to storage
    2018-03-17 01:22:18,001 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
    2018-03-17 01:22:18,001 - (Canary) - check 0
    2018-03-17 01:22:19,002 - (Canary) - check 1
    2018-03-17 01:22:20,002 - (Canary) - check 2
    2018-03-17 01:22:21,002 - (Canary) - check 3
    2018-03-17 01:22:22,003 - (Canary) - check 4
    2018-03-17 01:22:23,003 - (Canary) - check 5
    2018-03-17 01:22:24,004 - (Canary) - check 6
    2018-03-17 01:22:25,004 - (Canary) - check 7
    2018-03-17 01:22:26,004 - (Canary) - check 8
    2018-03-17 01:22:27,005 - (Canary) - check 9
    2018-03-17 01:22:28,005 - (MainThread) - Calling mmapobject.size

    (I removed the ipatables rule here)

    2018-03-17 01:23:38,701 - (MainThread) - OK
    2018-03-17 01:23:38,701 - (MainThread) - Done
    2018-03-17 01:23:38,701 - (Canary) - check 10

    I found that os.fdopen issue does not affect RHEL/CentOS 7, because they
    use python 2.7.5, and the issue was introduced in python 2.7.7, in:

    commit 5c863bf
    Author: Benjamin Peterson <benjamin@python.org>
    Date: Mon Apr 14 19:45:46 2014 -0400

    when an exception is raised in fdopen, never close the fd (changing on my mind on bpo-21191)
    

    This issue affects Fedora (python 2.7.14) and probably other distros using
    latest python 2.7.

    Here is example run show how this affects Fedora 27:

    # python fdopen_nfs_test.py mnt dumbo.tlv.redhat.com
    2018-03-17 01:43:52,718 - (MainThread) - Starting canary thread
    2018-03-17 01:43:52,718 - (Canary) - Blocking access to storage
    2018-03-17 01:43:52,823 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
    2018-03-17 01:43:52,824 - (Canary) - check 0
    2018-03-17 01:43:53,824 - (Canary) - check 1
    2018-03-17 01:43:54,824 - (Canary) - check 2
    2018-03-17 01:43:55,825 - (Canary) - check 3
    2018-03-17 01:43:56,825 - (Canary) - check 4
    2018-03-17 01:43:57,825 - (Canary) - check 5
    2018-03-17 01:43:58,826 - (Canary) - check 6
    2018-03-17 01:43:59,826 - (Canary) - check 7
    2018-03-17 01:44:00,826 - (Canary) - check 8
    2018-03-17 01:44:01,827 - (Canary) - check 9
    2018-03-17 01:44:02,827 - (Canary) - check 10
    2018-03-17 01:44:02,834 - (MainThread) - Calling os.fdopen

    (remove iptbales rule, and force-unmount here)

    2018-03-17 01:50:25,853 - (MainThread) - OK
    2018-03-17 01:50:25,854 - (Canary) - check 11
    2018-03-17 01:50:25,895 - (MainThread) - Done
    Traceback (most recent call last):
      File "fdopen_nfs_test.py", line 75, in <module>
        os.unlink(filename)
    OSError: [Errno 2] No such file or directory: 'mnt/test'

    So, I think we should:

    • backport to 3.7, 3.6
    • reconsider backport to 2.7, at least for mmap and os.fdopen.

    I can prepare the backports and split the 2.7 patch if this helps.

    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Mar 16, 2018

    Attaching reproducer for mmapobject.size()

    @nirs
    Copy link
    Mannequin Author

    nirs mannequin commented Mar 16, 2018

    Attaching reproducer for os.fdopen()

    @ncoghlan
    Copy link
    Contributor

    Regarding 2.7: if folks want this fixed on RHEL/CentOS, then they need to talk to Red Hat about it, not the upstream CPython devs. I used to work there, and was told multiple times by Red Hat executives that none of their customers actually used Python, so the Python ecosystem wasn't of any strategic interest to them, and hence zero funding was available for full-time upstream CPython maintenance econtributions.

    As such, I no longer consider it acceptable for anyone to ask community volunteers to compensate for Red Hat's abject negligence and executive incompetence.

    @rbtcollins
    Copy link
    Member

    Re: backporting this. I think backporting to 3.6/3.7 makes a lot of sense - bugfix and prerelease respectively.

    For 2.7, this bug has been around for ages, the patch is small, and I have no objection - but the RM has already said no, so I guess not happening :).

    That said, if I was analyzing this from scratch I'd look at the pypi download stats to assess what footprint 2.7 still has, and whether taking this fix there would make objective sense.

    While it is a genuine bug - and a nice catch - I have to say that running any Python with mmapped data off of NFS is a supremely bad idea :).

    @pitrou
    Copy link
    Member

    pitrou commented Mar 20, 2018

    Reopening for 3.x backports, as per Robert's advice.

    @pitrou pitrou added the 3.7 (EOL) end of life label Mar 20, 2018
    @pitrou pitrou reopened this Mar 20, 2018
    @pitrou
    Copy link
    Member

    pitrou commented Mar 20, 2018

    New changeset 56cce1c by Antoine Pitrou (Miss Islington (bot)) in branch '3.7':
    [3.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6159)
    56cce1c

    @pitrou
    Copy link
    Member

    pitrou commented Mar 20, 2018

    New changeset f3e6ead by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
    [3.6] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6160)
    f3e6ead

    @pitrou pitrou closed this as completed Mar 20, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants