classification
Title: [2.7] Race in PyThread_release_lock on macOS - can lead to memory corruption and deadlock
Type: behavior Stage: resolved
Components: Interpreter Core, macOS Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: benjamin.peterson Nosy List: arigo, beazley, benjamin.peterson, brian.curtin, dabeaz, flox, jyasskin, kristjan.jonsson, loewis, mariosoft, navytux, ned.deily, paul.moore, pitrou, python-dev, r.david.murray, rhettinger, ronaldoussoren, sbt, scoder, shihao, techtonik, tim.golden, tim.peters, torsten, vstinner, ysj.ray
Priority: normal Keywords: patch

Created on 2019-09-11 11:17 by navytux, last changed 2019-10-03 08:06 by navytux. This issue is now closed.

Files
File name Uploaded Description Edit
pylock_bug.pyx navytux, 2019-09-11 11:17
Pull Requests
URL Status Linked Edit
PR 16047 merged navytux, 2019-09-12 13:24
Messages (15)
msg351826 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-11 11:17
Hello up there. I believe I've discovered a race in PyThread_release_lock on
Python2.7 that, on systems where POSIX semaphores are not available and Python
locks are implemented with mutexes and condition variables, can lead to MEMORY
CORRUPTION and DEADLOCK. The particular system I've discovered the bug on is
macOS Mojave 10.14.6.

The bug is already fixed on Python3 and the fix for Python2 is easy:

    git cherry-pick 187aa545165d

Thanks beforehand,
Kirill


Bug description

( Please see attached pylock_bug.pyx for the program that triggers the bug for real. )

On Darwin, even though this is considered as POSIX, Python uses
mutex+condition variable to implement its lock, and, as of 20190828, Py2.7
implementation, even though similar issue was fixed for Py3 in 2012, contains
synchronization bug: the condition is signalled after mutex unlock while the
correct protocol is to signal condition from under mutex:

  https://github.com/python/cpython/blob/v2.7.16-127-g0229b56d8c0/Python/thread_pthread.h#L486-L506
  https://github.com/python/cpython/commit/187aa545165d (py3 fix)

PyPy has the same bug for both pypy2 and pypy3:

  https://bitbucket.org/pypy/pypy/src/578667b3fef9/rpython/translator/c/src/thread_pthread.c#lines-443:465
  https://bitbucket.org/pypy/pypy/src/5b42890d48c3/rpython/translator/c/src/thread_pthread.c#lines-443:465

Signalling condition outside of corresponding mutex is considered OK by
POSIX, but in Python context it can lead to at least memory corruption if we
consider the whole lifetime of python level lock. For example the following
logical scenario:

      T1                                          T2

  sema = Lock()
  sema.acquire()

                                              sema.release()

  sema.acquire()
  free(sema)

  ...


can translate to the next C-level calls:

      T1                                          T2

  # sema = Lock()
  sema = malloc(...)
  sema.locked = 0
  pthread_mutex_init(&sema.mut)
  pthread_cond_init (&sema.lock_released)

  # sema.acquire()
  pthread_mutex_lock(&sema.mut)
  # sees sema.locked == 0
  sema.locked = 1
  pthread_mutex_unlock(&sema.mut)


                                              # sema.release()
                                              pthread_mutex_lock(&sema.mut)
                                              sema.locked = 0
                                              pthread_mutex_unlock(&sema.mut)

                      # OS scheduler gets in and relinquishes control from T2
                      # to another process
                                              ...

  # second sema.acquire()
  pthread_mutex_lock(&sema.mut)
  # sees sema.locked == 0
  sema.locked = 1
  pthread_mutex_unlock(&sema.mut)

  # free(sema)
  pthread_mutex_destroy(&sema.mut)
  pthread_cond_destroy (&sema.lock_released)
  free(sema)


  # ...
  e.g. malloc() which returns memory where sema was

                                              ...
                      # OS scheduler returns control to T2
                      # sema.release() continues
                      #
                      # BUT sema was already freed and writing to anywhere
                      # inside sema block CORRUPTS MEMORY. In particular if
                      # _another_ python-level lock was allocated where sema
                      # block was, writing into the memory can have effect on
                      # further synchronization correctness and in particular
                      # lead to deadlock on lock that was next allocated.
                                              pthread_cond_signal(&sema.lock_released)

Note that T2.pthread_cond_signal(&sema.lock_released) CORRUPTS MEMORY as it
is called when sema memory was already freed and is potentially
reallocated for another object.

The fix is to move pthread_cond_signal to be done under corresponding mutex:

  # sema.release()
  pthread_mutex_lock(&sema.mut)
  sema.locked = 0
  pthread_cond_signal(&sema.lock_released)
  pthread_mutex_unlock(&sema.mut)

by cherry-picking commit 187aa545165d ("Signal condition variables with the
mutex held. Destroy condition variables before their mutexes").


Bug history

The bug was there since 1994 - since at least [1]. It was discussed in 2001
with original code author[2], but the code was still considered to be
race-free. In 2010 the place where pthread_cond_signal should be - before or
after pthread_mutex_unlock - was discussed with the rationale to avoid
threads bouncing[3,4,5], and in 2012 pthread_cond_signal was moved to be
called from under mutex, but only for CPython3[6,7].

In 2019 the bug was (re-)discovered while testing Pygolang[8] on macOS with
CPython2 and PyPy2 and PyPy3.

[1] https://github.com/python/cpython/commit/2c8cb9f3d240
[2] https://bugs.python.org/issue433625
[3] https://bugs.python.org/issue8299#msg103224
[4] https://bugs.python.org/issue8410#msg103313
[5] https://bugs.python.org/issue8411#msg113301
[6] https://bugs.python.org/issue15038#msg163187
[7] https://github.com/python/cpython/commit/187aa545165d
[8] https://pypi.org/project/pygolang
msg351917 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2019-09-11 14:57
I may be wrong, but I believe that the bug requires using the C API (not just pure Python code).  This is because Python-level lock objects have their own lifetime, and should never be freed while another thread is in PyThread_release_lock() with them.

Nevertheless, the example shows that using this C API "correctly" is very hard.  Most direct users of the C API could run into the same problem in theory.
msg351943 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2019-09-11 15:30
Interesting.  Yet another reason to always do condition signalling with the lock held, such as is good practice to avoid race conditions.  That's the whole point of condition variables.
msg351970 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-11 16:39
Thanks for feedback. Yes, since for Python-level lock, PyThread_release_lock() is called with GIL held:

https://github.com/python/cpython/blob/v2.7.16-129-g58d61efd4cd/Modules/threadmodule.c#L69-L82

the GIL effectively serves as the synchronization device in between T2
releasing the lock, and T1 proceeding after second sema.acquire() when it gets to
execute python-level code with `del sema`.

However

a) there is no sign that this aspect - that release must be called under GIL -
   is being explicitly relied upon by PyThread_release_lock() code, and

b) e.g. _testcapimodule.c already has a test which calls
   PyThread_release_lock() with GIL released:

   https://github.com/python/cpython/blob/v2.7.16-129-g58d61efd4cd/Modules/_testcapimodule.c#L1972-L2053
   https://github.com/python/cpython/blob/v2.7.16-129-g58d61efd4cd/Modules/_testcapimodule.c#L1998-L2002

Thus, I believe, there is a bug in PyThread_release_lock() and we were just
lucky not to hit it due to GIL and Python-level usage.

For the reference, I indeed started to observe the problem when I moved locks
and other code that implement channels in Pygolang from Python to C level:

https://lab.nexedi.com/kirr/pygolang/commit/69db91bf
https://lab.nexedi.com/kirr/pygolang/commit/3b241983?expand_all_diffs=1
msg351972 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-11 16:41
And it is indeed better to always do pthread_cond_signal() from under mutex.
Many pthread libraries delay the signalling to associated mutex unlock, so
there should be no performance penalty here and the correctness is much more
easier to reason about.
msg352025 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2019-09-11 21:57
imho posix made a mistake in allowing signal/broadcast outside the mutex.  Otherwise an implementation could rely on the mutex for internal state manipulation.  I have my own fast condition variable lib implemented using semaphores and it is simple to do if one requires the mutex to be held for the signal event.

Condition variables semantics are otherwise quite brilliant. For example, allowing for spurious wakeups to occur allows, again, for much simpler implementation.
msg352054 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-12 07:44
I agree it seems like a design mistake. Not only it leads to suboptimal
implementations, but what is more important, it throws misuse risks onto the user.
msg352058 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2019-09-12 08:52
I agree with your analysis.  I guess you (or someone) needs to write an explicit pull request, even if it just contains 187aa545165d cherry-picked.  (I'm not a core dev any more nowadays)
msg352159 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-12 13:26
Ok, I did https://github.com/python/cpython/pull/16047.
msg353723 - (view) Author: Mariusz Wawrzonkowski (mariosoft) Date: 2019-10-02 07:20
I can confirm that I have the deadlock because of the reason described above. When I change the PyThread_release_lock source locally to signal before releasing the mutex, then my deadlock does not occur anymore. I have tested it on a heave multithreading model.
msg353820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-03 07:06
New changeset c5abd63e94fcea10bdfa7a20842c6af3248bbd74 by Victor Stinner (Kirill Smelkov) in branch '2.7':
bpo-38106: Fix race in pthread PyThread_release_lock() (GH-16047)
https://github.com/python/cpython/commit/c5abd63e94fcea10bdfa7a20842c6af3248bbd74
msg353821 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-03 07:10
I merged Kirill Smelkov's fix. Thanks. I close the issue.

I'm surprised that we still find new bugs in this code which is supposed to be battle tested! Maybe recent Darwin changed made the bug more likely.

PyPy: it's now your turn to fix it ;-)
msg353824 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-10-03 07:56
Victor, thanks for merging.

> I'm surprised that we still find new bugs in this code which is supposed to
> be battle tested! Maybe recent Darwin changed made the bug more likely.

As discussed above (https://bugs.python.org/issue38106#msg351917,
https://bugs.python.org/issue38106#msg351970), due to the GIL, the bug is not
possible to trigger from pure-python code, and it can be hit only via using CAPI
directly. I indeed started to observe the problem after moving locks and
other code that implement channels in Pygolang from Python to C level (see
"0.0.3" in https://pypi.org/project/pygolang/#pygolang-change-history)

The bug was there since 1994 and, from my point of view, it was not discovered
because locking functionality was not enough hammer-tested. The bug was also
not possible to explain without taking lock lifetime into account, as, without
create/destroy, just lock/unlock sequence was race free. 
https://bugs.python.org/issue433625 confirms that.

I cannot say about whether recent macOS changes are relevant. My primary
platform is Linux and I only recently started to use macOS under QEMU for
testing. However from my brief study of https://github.com/apple/darwin-libpthread
I believe the difference in scheduling related to pthread condition variable
signalling under macOS and Linux is there already for a long time.

> PyPy: it's now your turn to fix it ;-)

PyPy people fixed the bug the same day it was reported:

https://bitbucket.org/pypy/pypy/issues/3072

:)

Kirill

P.S. Mariusz, thanks also for your feedback.
msg353825 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-03 08:02
> PyPy people fixed the bug the same day it was reported: ...

Oh ok, so the CPython commit message is already outdated :-)

"PyPy has the same bug for both pypy2 and pypy3 (...)"
msg353826 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-10-03 08:06
:) Yes and no. PyPy did not make a new release with the fix yet.
History
Date User Action Args
2019-10-03 08:06:54navytuxsetmessages: + msg353826
2019-10-03 08:02:19vstinnersetmessages: + msg353825
2019-10-03 07:56:11navytuxsetmessages: + msg353824
2019-10-03 07:10:00vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg353821

stage: patch review -> resolved
2019-10-03 07:06:57vstinnersetmessages: + msg353820
2019-10-02 07:20:26mariosoftsetnosy: + mariosoft
messages: + msg353723
2019-09-16 14:47:30terry.reedysetnosy: - terry.reedy
2019-09-16 08:10:18vstinnersettitle: Race in PyThread_release_lock - can lead to memory corruption and deadlock -> [2.7] Race in PyThread_release_lock on macOS - can lead to memory corruption and deadlock
2019-09-12 13:26:45navytuxsetmessages: + msg352159
2019-09-12 13:24:05navytuxsetkeywords: + patch
stage: patch review
pull_requests: + pull_request15669
2019-09-12 08:52:12arigosetmessages: + msg352058
2019-09-12 07:44:52navytuxsetmessages: + msg352054
2019-09-11 21:57:19kristjan.jonssonsetmessages: + msg352025
2019-09-11 16:41:30navytuxsetmessages: + msg351972
2019-09-11 16:39:46navytuxsetmessages: + msg351970
2019-09-11 15:30:36kristjan.jonssonsetmessages: + msg351943
2019-09-11 14:57:08arigosetmessages: + msg351917
2019-09-11 12:55:44gvanrossumsetassignee: benjamin.peterson
title: Race in PyThread_release_lock - can lead to MEMORY CORRUPTION and DEADLOCK -> Race in PyThread_release_lock - can lead to memory corruption and deadlock
nosy: - gvanrossum
2019-09-11 11:17:33navytuxcreate