This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Some fstat() calls do not release the GIL, possibly hanging all threads
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ZackerySpytz, brett.cannon, eric.snow, josh.r, ncoghlan, nirs, pitrou, pmpp, rbcollins, serhiy.storchaka, twouters, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2018-03-07 23:03 by nirs, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
mmap_nfs_test.py nirs, 2018-03-16 23:57
mmap_size_nfs_test.py nirs, 2018-03-16 23:58
fdopen_nfs_test.py nirs, 2018-03-16 23:58
Pull Requests
URL Status Linked Edit
PR 6019 merged nirs, 2018-03-07 23:07
PR 6020 closed nirs, 2018-03-07 23:14
PR 6117 merged ZackerySpytz, 2018-03-14 19:07
PR 6159 merged miss-islington, 2018-03-20 18:51
PR 6160 merged miss-islington, 2018-03-20 18:51
Messages (17)
msg313407 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-07 23:03
If the file descriptor is on a non-responsive NFS server, calling
fstat() can block for long time, hanging all threads. Most of the fstat()
calls release the GIL around the call, but some calls seems to be
forgotten.

In python 3, the calls are handled now by _py_fstat(), releasing the GIL
internally, but some calls use _py_fstat_noraise() which does not release
the GIL. Most of the calls to _py_fstat_noraise() release the GIL around
the call, except these 2 calls, affecting users of:

- mmap.mmap()
- os.urandom()
- random.seed()

In python there are more fstat() calls to fix, affecting users of:

- imp.load_dynamic()
- imp.load_source()
- mmap.mmap()
- mmapobject.size()
- os.fdopen()
- os.urandom()
- random.seed()
msg313409 - (view) Author: pmp-p (pmpp) * Date: 2018-03-07 23:21
is fstat thread safe ?
msg313417 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2018-03-08 02:24
fstat is async signal safe, and I suspect it's thread safe in general, though usage does rely on the file descriptor remaining valid. If the fd in question is closed in another thread after the GIL is released, fstat would fail; if a new file is opened and assigned the same fd, it would give erroneous results. But I don't think any of that would lead to fundamentally incorrect behavior (after all, you could do the same thing manually by caching an fd then closing the file).
msg313419 - (view) Author: pmp-p (pmpp) * Date: 2018-03-08 03:01
josh.r. i think you are right, i was worried if a nfs sillyrename is in progress, for eg a lock file ,then server hangs but thread lift the GIL and allow another thread to try to start to fstat the same path.
msg313434 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-08 11:28
Python cannot protect raw file descriptor from bad multi-threaded
application. For example the application may close a file descriptor twice
which may lead to closing unrelated file descriptor created by another
thread just after it was closed, before the second close.

This issue affects all function using raw file descriptors, and we cannot
protect them with the GIL.

Even if fstat was not thread safe, we cannot protect it using the GIl
since this blocks the entire application until fstat returns.
msg313627 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-11 23:39
New changeset 4484f9dca9149da135bbae035f10a50d20d1cbbb by Antoine Pitrou (Nir Soffer) in branch 'master':
bpo-33021: Release the GIL during fstat() calls (GH-6019)
https://github.com/python/cpython/commit/4484f9dca9149da135bbae035f10a50d20d1cbbb
msg313836 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2018-03-14 19:09
Commit 4484f9dca9149da135bbae035f10a50d20d1cbbb causes GCC 7.2.0 to emit a warning.

cpython/Modules/mmapmodule.c: In function ‘new_mmap_object’:
cpython/Modules/mmapmodule.c:1126:18: warning: ‘fstat_result’ may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (fd != -1 && fstat_result == 0 && S_ISREG(status.st_mode)) {

A PR is attached.
msg313837 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-14 20:08
New changeset d6e140466142018ddbb7541185348be2b833a2ce by Antoine Pitrou (Zackery Spytz) in branch 'master':
bpo-33021: Fix GCC 7 warning (-Wmaybe-uninitialized) in mmapmodule.c (#6117)
https://github.com/python/cpython/commit/d6e140466142018ddbb7541185348be2b833a2ce
msg313839 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-14 20:10
With Benjamin we've decided that this wouldn't happen in 2.7.  Performance improvements don't warrant a backport.

Thank you Nir for reporting and posting the patches!
msg313980 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-16 23:57
Antoine, thanks for fixing this on master! but I don't think this issue
can be closed yet.

First, the issue is not a performance but reliability. I probably made 
bad choice when I marked this as performance.

When you call mmap.mmap() in one thread, the entire process hangs for
an hour because the file descriptor is on a non-responsive NFS server.

With the fix, only the thread accessing the file descriptor is affected.
The rest of the system can function normally.

Second, the issue affects python 2.7, which is the production version on
many servers, and will be for many years e.g. on RHEL/CentOS 7. I think
it is important to fix this issue for these users.

Here is examples of the issue using reproducer scripts I uploaded to the
bug.

When mmap.mmap block, the entire process hangs. I unblocked the process from 
another shell by removing the iptables rule.

# python bpo-33021/mmap_nfs_test.py mnt dumbo.tlv.redhat.com
2018-03-17 01:17:57,846 - (MainThread) - Starting canary thread
2018-03-17 01:17:57,846 - (Canary) - Blocking access to storage
2018-03-17 01:17:57,857 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2018-03-17 01:17:57,857 - (Canary) - check 0
2018-03-17 01:17:58,858 - (Canary) - check 1
2018-03-17 01:17:59,858 - (Canary) - check 2
2018-03-17 01:18:00,859 - (Canary) - check 3
2018-03-17 01:18:01,859 - (Canary) - check 4
2018-03-17 01:18:02,859 - (Canary) - check 5
2018-03-17 01:18:03,860 - (Canary) - check 6
2018-03-17 01:18:04,860 - (Canary) - check 7
2018-03-17 01:18:05,861 - (Canary) - check 8
2018-03-17 01:18:06,861 - (Canary) - check 9
2018-03-17 01:18:07,862 - (Canary) - check 10
2018-03-17 01:18:07,868 - (MainThread) - Calling mmap.mmap

(I remove the iptables rule here)

2018-03-17 01:18:57,683 - (MainThread) - OK
2018-03-17 01:18:57,683 - (MainThread) - Done
2018-03-17 01:18:57,683 - (Canary) - check 11

When mmapobject.size() was called, the entire process was hang. I unblocked the
process from another shell by removing the iptables rule.

# python bpo-33021/mmap_size_nfs_test.py mnt dumbo.tlv.redhat.com
2018-03-17 01:22:17,991 - (MainThread) - Starting canary thread
2018-03-17 01:22:17,992 - (Canary) - Blocking access to storage
2018-03-17 01:22:18,001 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2018-03-17 01:22:18,001 - (Canary) - check 0
2018-03-17 01:22:19,002 - (Canary) - check 1
2018-03-17 01:22:20,002 - (Canary) - check 2
2018-03-17 01:22:21,002 - (Canary) - check 3
2018-03-17 01:22:22,003 - (Canary) - check 4
2018-03-17 01:22:23,003 - (Canary) - check 5
2018-03-17 01:22:24,004 - (Canary) - check 6
2018-03-17 01:22:25,004 - (Canary) - check 7
2018-03-17 01:22:26,004 - (Canary) - check 8
2018-03-17 01:22:27,005 - (Canary) - check 9
2018-03-17 01:22:28,005 - (MainThread) - Calling mmapobject.size

(I removed the ipatables rule here)

2018-03-17 01:23:38,701 - (MainThread) - OK
2018-03-17 01:23:38,701 - (MainThread) - Done
2018-03-17 01:23:38,701 - (Canary) - check 10

I found that os.fdopen issue does not affect RHEL/CentOS 7, because they
use python 2.7.5, and the issue was introduced in python 2.7.7, in:

commit 5c863bf93809cefeb4469512eadac291b7046051
Author: Benjamin Peterson <benjamin@python.org>
Date:   Mon Apr 14 19:45:46 2014 -0400

    when an exception is raised in fdopen, never close the fd (changing on my mind on #21191)

This issue affects Fedora (python 2.7.14) and probably other distros using
latest python 2.7.

Here is example run show how this affects Fedora 27:

# python fdopen_nfs_test.py mnt dumbo.tlv.redhat.com
2018-03-17 01:43:52,718 - (MainThread) - Starting canary thread
2018-03-17 01:43:52,718 - (Canary) - Blocking access to storage
2018-03-17 01:43:52,823 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP
2018-03-17 01:43:52,824 - (Canary) - check 0
2018-03-17 01:43:53,824 - (Canary) - check 1
2018-03-17 01:43:54,824 - (Canary) - check 2
2018-03-17 01:43:55,825 - (Canary) - check 3
2018-03-17 01:43:56,825 - (Canary) - check 4
2018-03-17 01:43:57,825 - (Canary) - check 5
2018-03-17 01:43:58,826 - (Canary) - check 6
2018-03-17 01:43:59,826 - (Canary) - check 7
2018-03-17 01:44:00,826 - (Canary) - check 8
2018-03-17 01:44:01,827 - (Canary) - check 9
2018-03-17 01:44:02,827 - (Canary) - check 10
2018-03-17 01:44:02,834 - (MainThread) - Calling os.fdopen

(remove iptbales rule, and force-unmount here) 

2018-03-17 01:50:25,853 - (MainThread) - OK
2018-03-17 01:50:25,854 - (Canary) - check 11
2018-03-17 01:50:25,895 - (MainThread) - Done
Traceback (most recent call last):
  File "fdopen_nfs_test.py", line 75, in <module>
    os.unlink(filename)
OSError: [Errno 2] No such file or directory: 'mnt/test'


So, I think we should:
- backport to 3.7, 3.6
- reconsider backport to 2.7, at least for mmap and os.fdopen.

I can prepare the backports and split the 2.7 patch if this helps.
msg313981 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-16 23:58
Attaching reproducer for mmapobject.size()
msg313982 - (view) Author: Nir Soffer (nirs) * Date: 2018-03-16 23:58
Attaching reproducer for os.fdopen()
msg314024 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-03-18 06:04
Regarding 2.7: if folks want this fixed on RHEL/CentOS, then they need to talk to Red Hat about it, not the upstream CPython devs. I used to work there, and was told multiple times by Red Hat executives that none of their customers actually used Python, so the Python ecosystem wasn't of any strategic interest to them, and hence zero funding was available for full-time upstream CPython maintenance econtributions.

As such, I no longer consider it acceptable for anyone to ask community volunteers to compensate for Red Hat's abject negligence and executive incompetence.
msg314083 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2018-03-19 08:25
Re: backporting this. I think backporting to 3.6/3.7 makes a lot of sense - bugfix and prerelease respectively.

For 2.7, this bug has been around for ages, the patch is small, and I have no objection - but the RM has already said no, so I guess not happening :).

That said, if I was analyzing this from scratch I'd look at the pypi download stats to assess what footprint 2.7 still has, and whether taking this fix there would make objective sense.

While it is a genuine bug - and a nice catch - I have to say that running any Python with mmapped data off of NFS is a supremely bad idea :).
msg314158 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-20 18:48
Reopening for 3.x backports, as per Robert's advice.
msg314159 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-20 19:16
New changeset 56cce1ca84344f519f7ed01024434c083031d354 by Antoine Pitrou (Miss Islington (bot)) in branch '3.7':
[3.7] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6159)
https://github.com/python/cpython/commit/56cce1ca84344f519f7ed01024434c083031d354
msg314161 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-03-20 19:40
New changeset f3e6eadbcf4f3e0fe53f4784485b1c8464c7d282 by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
[3.6] bpo-33021: Release the GIL during fstat() calls (GH-6019) (GH-6160)
https://github.com/python/cpython/commit/f3e6eadbcf4f3e0fe53f4784485b1c8464c7d282
History
Date User Action Args
2022-04-11 14:58:58adminsetgithub: 77202
2018-03-20 19:40:42pitrousetstatus: open -> closed
stage: patch review -> resolved
2018-03-20 19:40:20pitrousetmessages: + msg314161
2018-03-20 19:16:33pitrousetmessages: + msg314159
2018-03-20 18:51:43miss-islingtonsetpull_requests: + pull_request5917
2018-03-20 18:51:07miss-islingtonsetstage: resolved -> patch review
pull_requests: + pull_request5916
2018-03-20 18:48:09pitrousetstatus: closed -> open

messages: + msg314158
versions: + Python 3.6, Python 3.7
2018-03-19 08:25:55rbcollinssetnosy: + rbcollins
messages: + msg314083
2018-03-18 06:04:33ncoghlansetmessages: + msg314024
2018-03-16 23:58:46nirssetfiles: + fdopen_nfs_test.py

messages: + msg313982
2018-03-16 23:58:26nirssetfiles: + mmap_size_nfs_test.py

messages: + msg313981
2018-03-16 23:57:41nirssetfiles: + mmap_nfs_test.py

messages: + msg313980
2018-03-14 20:10:30pitrousetstatus: open -> closed
versions: - Python 2.7, Python 3.6, Python 3.7
messages: + msg313839

resolution: fixed
stage: patch review -> resolved
2018-03-14 20:08:10pitrousetmessages: + msg313837
2018-03-14 19:09:47ZackerySpytzsetnosy: + ZackerySpytz
messages: + msg313836
2018-03-14 19:07:24ZackerySpytzsetpull_requests: + pull_request5880
2018-03-11 23:39:24pitrousetnosy: + pitrou
messages: + msg313627
2018-03-08 11:28:39nirssetmessages: + msg313434
2018-03-08 03:01:58pmppsetmessages: + msg313419
2018-03-08 02:24:30josh.rsetnosy: + josh.r
messages: + msg313417
2018-03-07 23:21:46pmppsetnosy: + pmpp
messages: + msg313409
2018-03-07 23:14:13nirssetpull_requests: + pull_request5787
2018-03-07 23:07:33nirssetkeywords: + patch
stage: patch review
pull_requests: + pull_request5786
2018-03-07 23:03:24nirscreate