Some fstat() calls do not release the GIL, possibly hanging all threads #77202

nirs · 2018-03-07T23:03:24Z

BPO	33021
Nosy	@Yhg1s, @brettcannon, @ncoghlan, @nirs, @pitrou, @vstinner, @rbtcollins, @pmp-p, @ericsnowcurrently, @serhiy-storchaka, @1st1, @MojoVampire, @ZackerySpytz
PRs	bpo-33021: Release the GIL during fstat() calls #6019 [2.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) #6020 bpo-33021: Fix GCC 7 warning (-Wmaybe-uninitialized) in mmapmodule.c #6117 [3.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) #6159 [3.6] bpo-33021: Release the GIL during fstat() calls (GH-6019) #6160
Files	mmap_nfs_test.py mmap_size_nfs_test.py fdopen_nfs_test.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-03-20.19:40:42.831>
created_at = <Date 2018-03-07.23:03:24.256>
labels = ['3.8', '3.7', 'library', 'performance']
title = 'Some fstat() calls do not release the GIL, possibly hanging all threads'
updated_at = <Date 2018-03-20.19:40:42.831>
user = 'https://github.com/nirs'

bugs.python.org fields:

activity = <Date 2018-03-20.19:40:42.831>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2018-03-20.19:40:42.831>
closer = 'pitrou'
components = ['Library (Lib)']
creation = <Date 2018-03-07.23:03:24.256>
creator = 'nirs'
dependencies = []
files = ['47490', '47491', '47492']
hgrepos = []
issue_num = 33021
keywords = ['patch']
message_count = 17.0
messages = ['313407', '313409', '313417', '313419', '313434', '313627', '313836', '313837', '313839', '313980', '313981', '313982', '314024', '314083', '314158', '314159', '314161']
nosy_count = 13.0
nosy_names = ['twouters', 'brett.cannon', 'ncoghlan', 'nirs', 'pitrou', 'vstinner', 'rbcollins', 'pmpp', 'eric.snow', 'serhiy.storchaka', 'yselivanov', 'josh.r', 'ZackerySpytz']
pr_nums = ['6019', '6020', '6117', '6159', '6160']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue33021'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

nirs · 2018-03-07T23:03:24Z

If the file descriptor is on a non-responsive NFS server, calling
fstat() can block for long time, hanging all threads. Most of the fstat()
calls release the GIL around the call, but some calls seems to be
forgotten.

In python 3, the calls are handled now by _py_fstat(), releasing the GIL
internally, but some calls use _py_fstat_noraise() which does not release
the GIL. Most of the calls to _py_fstat_noraise() release the GIL around
the call, except these 2 calls, affecting users of:

mmap.mmap()
os.urandom()
random.seed()

In python there are more fstat() calls to fix, affecting users of:

imp.load_dynamic()
imp.load_source()
mmap.mmap()
mmapobject.size()
os.fdopen()
os.urandom()
random.seed()

pmp-p · 2018-03-07T23:21:46Z

is fstat thread safe ?

MojoVampire · 2018-03-08T02:24:30Z

fstat is async signal safe, and I suspect it's thread safe in general, though usage does rely on the file descriptor remaining valid. If the fd in question is closed in another thread after the GIL is released, fstat would fail; if a new file is opened and assigned the same fd, it would give erroneous results. But I don't think any of that would lead to fundamentally incorrect behavior (after all, you could do the same thing manually by caching an fd then closing the file).

pmp-p · 2018-03-08T03:01:59Z

josh.r. i think you are right, i was worried if a nfs sillyrename is in progress, for eg a lock file ,then server hangs but thread lift the GIL and allow another thread to try to start to fstat the same path.

nirs · 2018-03-08T11:28:39Z

Python cannot protect raw file descriptor from bad multi-threaded
application. For example the application may close a file descriptor twice
which may lead to closing unrelated file descriptor created by another
thread just after it was closed, before the second close.

This issue affects all function using raw file descriptors, and we cannot
protect them with the GIL.

Even if fstat was not thread safe, we cannot protect it using the GIl
since this blocks the entire application until fstat returns.

pitrou · 2018-03-11T23:39:24Z

New changeset 4484f9d by Antoine Pitrou (Nir Soffer) in branch 'master':
bpo-33021: Release the GIL during fstat() calls (GH-6019)
4484f9d

ZackerySpytz · 2018-03-14T19:09:47Z

Commit 4484f9d causes GCC 7.2.0 to emit a warning.

cpython/Modules/mmapmodule.c: In function ‘new_mmap_object’:
cpython/Modules/mmapmodule.c:1126:18: warning: ‘fstat_result’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (fd != -1 && fstat_result == 0 && S_ISREG(status.st_mode)) {

A PR is attached.

pitrou · 2018-03-14T20:08:10Z

New changeset d6e1404 by Antoine Pitrou (Zackery Spytz) in branch 'master':
bpo-33021: Fix GCC 7 warning (-Wmaybe-uninitialized) in mmapmodule.c (bpo-6117)
d6e1404

pitrou · 2018-03-14T20:10:31Z

With Benjamin we've decided that this wouldn't happen in 2.7. Performance improvements don't warrant a backport.

Thank you Nir for reporting and posting the patches!

nirs · 2018-03-16T23:57:42Z

Antoine, thanks for fixing this on master! but I don't think this issue
can be closed yet.

First, the issue is not a performance but reliability. I probably made
bad choice when I marked this as performance.

When you call mmap.mmap() in one thread, the entire process hangs for
an hour because the file descriptor is on a non-responsive NFS server.

With the fix, only the thread accessing the file descriptor is affected.
The rest of the system can function normally.

Second, the issue affects python 2.7, which is the production version on
many servers, and will be for many years e.g. on RHEL/CentOS 7. I think
it is important to fix this issue for these users.

Here is examples of the issue using reproducer scripts I uploaded to the
bug.

When mmap.mmap block, the entire process hangs. I unblocked the process from
another shell by removing the iptables rule.

# python bpo-33021/mmap_nfs_test.py mnt dumbo.tlv.redhat.com
2018-03-17 01:17:57,846 - (MainThread) - Starting canary thread
2018-03-17 01:17:57,846 - (Canary) - Blocking access to storage
2018-03-17 01:17:57,857 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2018-03-17 01:17:57,857 - (Canary) - check 0
2018-03-17 01:17:58,858 - (Canary) - check 1
2018-03-17 01:17:59,858 - (Canary) - check 2
2018-03-17 01:18:00,859 - (Canary) - check 3
2018-03-17 01:18:01,859 - (Canary) - check 4
2018-03-17 01:18:02,859 - (Canary) - check 5
2018-03-17 01:18:03,860 - (Canary) - check 6
2018-03-17 01:18:04,860 - (Canary) - check 7
2018-03-17 01:18:05,861 - (Canary) - check 8
2018-03-17 01:18:06,861 - (Canary) - check 9
2018-03-17 01:18:07,862 - (Canary) - check 10
2018-03-17 01:18:07,868 - (MainThread) - Calling mmap.mmap

(I remove the iptables rule here)

2018-03-17 01:18:57,683 - (MainThread) - OK
2018-03-17 01:18:57,683 - (MainThread) - Done
2018-03-17 01:18:57,683 - (Canary) - check 11

When mmapobject.size() was called, the entire process was hang. I unblocked the
process from another shell by removing the iptables rule.

# python bpo-33021/mmap_size_nfs_test.py mnt dumbo.tlv.redhat.com
2018-03-17 01:22:17,991 - (MainThread) - Starting canary thread
2018-03-17 01:22:17,992 - (Canary) - Blocking access to storage
2018-03-17 01:22:18,001 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2018-03-17 01:22:18,001 - (Canary) - check 0
2018-03-17 01:22:19,002 - (Canary) - check 1
2018-03-17 01:22:20,002 - (Canary) - check 2
2018-03-17 01:22:21,002 - (Canary) - check 3
2018-03-17 01:22:22,003 - (Canary) - check 4
2018-03-17 01:22:23,003 - (Canary) - check 5
2018-03-17 01:22:24,004 - (Canary) - check 6
2018-03-17 01:22:25,004 - (Canary) - check 7
2018-03-17 01:22:26,004 - (Canary) - check 8
2018-03-17 01:22:27,005 - (Canary) - check 9
2018-03-17 01:22:28,005 - (MainThread) - Calling mmapobject.size

(I removed the ipatables rule here)

2018-03-17 01:23:38,701 - (MainThread) - OK
2018-03-17 01:23:38,701 - (MainThread) - Done
2018-03-17 01:23:38,701 - (Canary) - check 10

I found that os.fdopen issue does not affect RHEL/CentOS 7, because they
use python 2.7.5, and the issue was introduced in python 2.7.7, in:

commit 5c863bf
Author: Benjamin Peterson <benjamin@python.org>
Date: Mon Apr 14 19:45:46 2014 -0400

when an exception is raised in fdopen, never close the fd (changing on my mind on bpo-21191)

This issue affects Fedora (python 2.7.14) and probably other distros using
latest python 2.7.

Here is example run show how this affects Fedora 27:

# python fdopen_nfs_test.py mnt dumbo.tlv.redhat.com
2018-03-17 01:43:52,718 - (MainThread) - Starting canary thread
2018-03-17 01:43:52,718 - (Canary) - Blocking access to storage
2018-03-17 01:43:52,823 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2018-03-17 01:43:52,824 - (Canary) - check 0
2018-03-17 01:43:53,824 - (Canary) - check 1
2018-03-17 01:43:54,824 - (Canary) - check 2
2018-03-17 01:43:55,825 - (Canary) - check 3
2018-03-17 01:43:56,825 - (Canary) - check 4
2018-03-17 01:43:57,825 - (Canary) - check 5
2018-03-17 01:43:58,826 - (Canary) - check 6
2018-03-17 01:43:59,826 - (Canary) - check 7
2018-03-17 01:44:00,826 - (Canary) - check 8
2018-03-17 01:44:01,827 - (Canary) - check 9
2018-03-17 01:44:02,827 - (Canary) - check 10
2018-03-17 01:44:02,834 - (MainThread) - Calling os.fdopen

(remove iptbales rule, and force-unmount here)

2018-03-17 01:50:25,853 - (MainThread) - OK
2018-03-17 01:50:25,854 - (Canary) - check 11
2018-03-17 01:50:25,895 - (MainThread) - Done
Traceback (most recent call last):
  File "fdopen_nfs_test.py", line 75, in <module>
    os.unlink(filename)
OSError: [Errno 2] No such file or directory: 'mnt/test'

So, I think we should:

backport to 3.7, 3.6
reconsider backport to 2.7, at least for mmap and os.fdopen.

I can prepare the backports and split the 2.7 patch if this helps.

nirs · 2018-03-16T23:58:26Z

Attaching reproducer for mmapobject.size()

nirs · 2018-03-16T23:58:46Z

Attaching reproducer for os.fdopen()

ncoghlan · 2018-03-18T06:04:33Z

Regarding 2.7: if folks want this fixed on RHEL/CentOS, then they need to talk to Red Hat about it, not the upstream CPython devs. I used to work there, and was told multiple times by Red Hat executives that none of their customers actually used Python, so the Python ecosystem wasn't of any strategic interest to them, and hence zero funding was available for full-time upstream CPython maintenance econtributions.

As such, I no longer consider it acceptable for anyone to ask community volunteers to compensate for Red Hat's abject negligence and executive incompetence.

rbtcollins · 2018-03-19T08:25:55Z

Re: backporting this. I think backporting to 3.6/3.7 makes a lot of sense - bugfix and prerelease respectively.

For 2.7, this bug has been around for ages, the patch is small, and I have no objection - but the RM has already said no, so I guess not happening :).

That said, if I was analyzing this from scratch I'd look at the pypi download stats to assess what footprint 2.7 still has, and whether taking this fix there would make objective sense.

While it is a genuine bug - and a nice catch - I have to say that running any Python with mmapped data off of NFS is a supremely bad idea :).

pitrou · 2018-03-20T18:48:09Z

Reopening for 3.x backports, as per Robert's advice.

pitrou · 2018-03-20T19:16:33Z

New changeset 56cce1c by Antoine Pitrou (Miss Islington (bot)) in branch '3.7':
[3.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6159)
56cce1c

pitrou · 2018-03-20T19:40:21Z

New changeset f3e6ead by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
[3.6] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6160)
f3e6ead

nirs mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Mar 7, 2018

pitrou removed the 3.7 (EOL) end of life label Mar 14, 2018

pitrou closed this as completed Mar 14, 2018

pitrou added the 3.7 (EOL) end of life label Mar 20, 2018

pitrou reopened this Mar 20, 2018

pitrou closed this as completed Mar 20, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some fstat() calls do not release the GIL, possibly hanging all threads #77202

Some fstat() calls do not release the GIL, possibly hanging all threads #77202

nirs mannequin commented Mar 7, 2018

nirs mannequin commented Mar 7, 2018

pmp-p mannequin commented Mar 7, 2018

MojoVampire mannequin commented Mar 8, 2018

pmp-p mannequin commented Mar 8, 2018

nirs mannequin commented Mar 8, 2018

pitrou commented Mar 11, 2018

ZackerySpytz mannequin commented Mar 14, 2018

pitrou commented Mar 14, 2018

pitrou commented Mar 14, 2018

nirs mannequin commented Mar 16, 2018

nirs mannequin commented Mar 16, 2018

nirs mannequin commented Mar 16, 2018

ncoghlan commented Mar 18, 2018

rbtcollins commented Mar 19, 2018

pitrou commented Mar 20, 2018

pitrou commented Mar 20, 2018

pitrou commented Mar 20, 2018

Some fstat() calls do not release the GIL, possibly hanging all threads #77202

Some fstat() calls do not release the GIL, possibly hanging all threads #77202

Comments

nirs mannequin commented Mar 7, 2018

nirs mannequin commented Mar 7, 2018

pmp-p mannequin commented Mar 7, 2018

MojoVampire mannequin commented Mar 8, 2018

pmp-p mannequin commented Mar 8, 2018

nirs mannequin commented Mar 8, 2018

pitrou commented Mar 11, 2018

ZackerySpytz mannequin commented Mar 14, 2018

pitrou commented Mar 14, 2018

pitrou commented Mar 14, 2018

nirs mannequin commented Mar 16, 2018

nirs mannequin commented Mar 16, 2018

nirs mannequin commented Mar 16, 2018

ncoghlan commented Mar 18, 2018

rbtcollins commented Mar 19, 2018

pitrou commented Mar 20, 2018

pitrou commented Mar 20, 2018

pitrou commented Mar 20, 2018