classification
Title: [2.7] test_threading.ConditionTests.test_notify() hangs randomly on Python 2.7
Type: Stage: resolved
Components: Tests Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: kristjan.jonsson, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2017-06-21 22:20 by vstinner, last changed 2017-06-26 23:04 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
notify_bug.py vstinner, 2017-06-21 23:17
Pull Requests
URL Status Linked Edit
PR 2320 merged vstinner, 2017-06-22 00:01
PR 2334 merged serhiy.storchaka, 2017-06-22 19:57
PR 2351 merged serhiy.storchaka, 2017-06-23 10:40
PR 2352 merged serhiy.storchaka, 2017-06-23 10:41
PR 2353 merged serhiy.storchaka, 2017-06-23 10:45
Messages (18)
msg296591 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-21 22:20
Example where the test hangs after at the 48th run:
----
[haypo@freebsd ~/prog/python/2.7]$ ./python -m test -F -m test.test_threading.ConditionTests.test_notify -v test_threading 
(...)
0:00:10 [ 47] test_threading
test_notify (test.test_threading.ConditionTests) ... ok
(...)
0:00:10 [ 48] test_threading
test_notify (test.test_threading.ConditionTests) ... 
----

It looks like a timing issue / race condition since the likehood of the bug seems to depend on the system load.

Traceback using my watchdog:
https://github.com/python/cpython/pull/2317

---
0:00:57 [285] test_threading
test_notify (test.test_threading.ConditionTests) ... *** STACKTRACE - START ***
# ThreadID: 34391913984
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 37, in task
  f()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 360, in f
  cond.wait()
File: "/usr/home/haypo/prog/python/2.7/Lib/threading.py", line 340, in wait
  waiter.acquire()

# ThreadID: 34392086528
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 37, in task
  f()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 360, in f
  cond.wait()
File: "/usr/home/haypo/prog/python/2.7/Lib/threading.py", line 340, in wait
  waiter.acquire()

# ThreadID: 34384994304
File: "/usr/home/haypo/prog/python/2.7/Lib/runpy.py", line 174, in _run_module_as_main
  "__main__", fname, loader, pkg_name)
File: "/usr/home/haypo/prog/python/2.7/Lib/runpy.py", line 72, in _run_code
  exec code in run_globals
File: "/usr/home/haypo/prog/python/2.7/Lib/test/__main__.py", line 3, in <module>
  regrtest.main_in_temp_cwd()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 2030, in main_in_temp_cwd
  main()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 812, in main
  result = local_runtest()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 798, in local_runtest
  testdir=testdir)
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 1007, in runtest
  return runtest_inner(test, verbose, quiet, huntrleaks, pgo, testdir)
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 1196, in runtest_inner
  indirect_test()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/test_threading.py", line 929, in test_main
  ThreadingExceptionTests,
File: "/usr/home/haypo/prog/python/2.7/Lib/test/support/__init__.py", line 1571, in run_unittest
  _run_suite(suite)
File: "/usr/home/haypo/prog/python/2.7/Lib/test/support/__init__.py", line 1530, in _run_suite
  result = runner.run(suite)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/runner.py", line 151, in run
  test(result)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/suite.py", line 70, in __call__
  return self.run(*args, **kwds)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/suite.py", line 108, in run
  test(result)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/suite.py", line 70, in __call__
  return self.run(*args, **kwds)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/suite.py", line 108, in run
  test(result)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/case.py", line 393, in __call__
  return self.run(*args, **kwds)
File: "/usr/home/haypo/prog/python/2.7/Lib/unittest/case.py", line 329, in run
  testMethod()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 401, in test_notify
  self._check_notify(cond)
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 384, in _check_notify
  _wait()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 16, in _wait
  time.sleep(0.01)

# ThreadID: 34384996864
File: "/usr/home/haypo/prog/python/2.7/Lib/threading.py", line 774, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/home/haypo/prog/python/2.7/Lib/threading.py", line 801, in __bootstrap_inner
  self.run()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 1971, in run
  self.dump_threads()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/regrtest.py", line 1997, in dump_threads
  self.dump_thread(stack)

# ThreadID: 34392083968
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 37, in task
  f()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 360, in f
  cond.wait()
File: "/usr/home/haypo/prog/python/2.7/Lib/threading.py", line 340, in wait
  waiter.acquire()

# ThreadID: 34392080128
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 37, in task
  f()
File: "/usr/home/haypo/prog/python/2.7/Lib/test/lock_tests.py", line 360, in f
  cond.wait()
File: "/usr/home/haypo/prog/python/2.7/Lib/threading.py", line 340, in wait
  waiter.acquire()

*** STACKTRACE - END ***
---
msg296593 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-21 22:38
test.test_threading.ConditionTests.test_notify() tests the threading._Condition.

threading._Condition uses internally a threading._RLock.

threading._RLock uses an internal lock created with thread.allocate_lock().

thread.allocate_lock() calls internally PyThread_allocate_lock(). On my Linux (Fedora 25) and FreeBSD (FreeBSD 11 VM) machines, PyThread_allocate_lock() calls sem_init(): POSIX pthread semaphores.

In Python 3, RLock was implemented in C for speed (the new Python 3.0 io module was slow because of RLock performance): see bpo-3001.

I know at least one race condition in Python 2.7 RLock: bpo-13697, RLock bug with signals.
msg296594 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-21 22:40
Oh, I reproduced the bug on Linux as well by running "./python -m test -F -m test.test_threading.ConditionTests.test_notify -v test_threading" 6 times in different terminals: the test hangs in two terminal after 2508 and 2262 runs.
msg296598 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-21 23:17
notify_bug.py: script close totest.test_threading.ConditionTests.test_notify() but different:

* _wait() sleeps 1 nanosecond rather than 10 ms
* log many messages into stdout (file descriptor 1) using os.write()

It's quite easy to reproduce the bug with this script.
msg296601 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-21 23:29
See also bpo-8799 "Hang in lib/test/test_threading.py" and the commit 020af2a2bc4708215360a3793b5a1790e15d05dd which added two _wait() calls to ConditionTests._check_notify().
msg296604 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-21 23:53
The bug occurs when notify(n) is called while less than n threads are waiting in cond.wait(): when some threads didn't reach cond.wait() yet. The synchronization code is very fragile and depends on time. The issue was worked again in the master banch by adding more _wait() calls...

To really fix the race condition, reliable synchronization primitives should be used, like threading.Event.
msg296606 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-22 00:01
It's getting late here, so I chose to first backport the commit 020af2a2bc4708215360a3793b5a1790e15d05dd to Python 2.7, to reduce the likehood of this bug on 2.7 buildbots.
msg296621 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2017-06-22 11:23
My favorite topic :)

You could use threading.Barrier() which is designed to synchronize N threads doing this kind of lock-step processing.  The problem is that the Barrier() is implemented using Condition variables, so for unit-testing, Condition Variables, this presents a conundrum...
msg296622 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-22 11:39
Yes, the whole reason I wrote the tests like this is to avoid depending on Python synchronization primitives inside the tests for Python synchronization primitives.
msg296623 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-22 11:48
I still didn't see a code, but perhaps we could use a spin-lock. Every child thread increases the global counter, and the main thread waits until the counter equal to the number of child threads (sleeping for short time in a loop) before calling notify().
msg296653 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-22 20:05
Actually this approach already is used in this test. PR 2334 uses it to ensure all workers into cond.wait().
msg296695 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-23 10:36
New changeset 32cb968a2edd482891c33b6f2ebae10f1d305424 by Serhiy Storchaka in branch 'master':
bpo-30727: Fix a race condition in test_threading. (#2334)
https://github.com/python/cpython/commit/32cb968a2edd482891c33b6f2ebae10f1d305424
msg296698 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-23 10:52
New changeset c1d5345679eaa5fccd719b1c130140eecc8ba4c8 by Serhiy Storchaka in branch '3.6':
[3.6] bpo-30727: Fix a race condition in test_threading. (GH-2334) (#2351)
https://github.com/python/cpython/commit/c1d5345679eaa5fccd719b1c130140eecc8ba4c8
msg296699 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-23 10:52
New changeset e2aec8e691d8acb08373889d9af48a5b1d03b689 by Serhiy Storchaka in branch '3.5':
[3.5] bpo-30727: Fix a race condition in test_threading. (GH-2334) (#2352)
https://github.com/python/cpython/commit/e2aec8e691d8acb08373889d9af48a5b1d03b689
msg296700 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-23 11:13
New changeset 7709b4d57b433ef027a2e7e63b4cab3fc9ad910d by Serhiy Storchaka in branch '2.7':
[2.7] bpo-30727: Fix a race condition in test_threading. (GH-2334). (#2353)
https://github.com/python/cpython/commit/7709b4d57b433ef027a2e7e63b4cab3fc9ad910d
msg296701 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-23 11:17
In Python 3 the test was failed if decrease the sleep time in _wait() to 0.001. Patched test no longer fail for any small sleep intervals.
msg296706 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-23 12:53
> In Python 3 the test was failed if decrease the sleep time in _wait() to 0.001. Patched test no longer fail for any small sleep intervals.

I used sleep(1e-9), 1 nanosecond, for my tests :-)

I confirm that I'm unable to reproduce the bug on 2.7 and master
branches anymore. Well done Serhiy!
msg296968 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-26 23:04
Hum, my commit was not reported in the issue:

commit da6d305b6fcd49ba1224b1fd2131d7648a5be6b9
Author: Victor Stinner <victor.stinner@gmail.com>
Date:   Thu Jun 22 02:09:40 2017 +0200

    bpo-8799: Reduce timing sensitivity of condition test by explicitly (#2320)
    delaying the main thread so that it doesn't race ahead of the workers.
    (cherry picked from commit 020af2a2bc4708215360a3793b5a1790e15d05dd)
History
Date User Action Args
2017-06-26 23:04:03vstinnersetmessages: + msg296968
2017-06-23 12:53:29vstinnersetmessages: + msg296706
2017-06-23 11:17:09serhiy.storchakasetstatus: open -> closed
versions: + Python 3.5, Python 3.6, Python 3.7
messages: + msg296701

resolution: fixed
stage: resolved
2017-06-23 11:13:41serhiy.storchakasetmessages: + msg296700
2017-06-23 10:52:24serhiy.storchakasetmessages: + msg296699
2017-06-23 10:52:08serhiy.storchakasetmessages: + msg296698
2017-06-23 10:45:53serhiy.storchakasetpull_requests: + pull_request2399
2017-06-23 10:41:39serhiy.storchakasetpull_requests: + pull_request2398
2017-06-23 10:40:05serhiy.storchakasetpull_requests: + pull_request2397
2017-06-23 10:36:39serhiy.storchakasetmessages: + msg296695
2017-06-22 20:05:48serhiy.storchakasetmessages: + msg296653
2017-06-22 19:57:13serhiy.storchakasetpull_requests: + pull_request2378
2017-06-22 11:48:07serhiy.storchakasetmessages: + msg296623
2017-06-22 11:39:11pitrousetmessages: + msg296622
2017-06-22 11:23:27kristjan.jonssonsetnosy: + kristjan.jonsson
messages: + msg296621
2017-06-22 00:01:22vstinnersetmessages: + msg296606
pull_requests: + pull_request2372
2017-06-21 23:53:38vstinnersetmessages: + msg296604
2017-06-21 23:29:17vstinnersetmessages: + msg296601
2017-06-21 23:17:10vstinnersetfiles: + notify_bug.py

messages: + msg296598
2017-06-21 22:46:56vstinnersettitle: [2.7] test_threading.ConditionTests.test_notify() hangs randomly on FreeBSD on Python 2.7 -> [2.7] test_threading.ConditionTests.test_notify() hangs randomly on Python 2.7
2017-06-21 22:40:26vstinnersetmessages: + msg296594
2017-06-21 22:38:56vstinnersetnosy: + pitrou, serhiy.storchaka
2017-06-21 22:38:43vstinnersetmessages: + msg296593
2017-06-21 22:20:05vstinnercreate