Hang in lib/test/test_threading.py #53045

kristjanvalur · 2010-05-24T09:44:16Z

BPO	8799
Nosy	@csernazs, @pitrou, @kristjanvalur, @vstinner
PRs	[2.7] bpo-8799: Reduce timing sensitivity of condition test by explicitly #2320
Files	locktest.patch lock_tests.diff tpconditiondoc.patch lock_tests.patch: updated ro

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-07-16.22:25:49.874>
created_at = <Date 2010-05-24.09:44:15.789>
labels = ['type-bug', 'library']
title = 'Hang in lib/test/test_threading.py'
updated_at = <Date 2018-07-16.22:25:49.873>
user = 'https://github.com/kristjanvalur'

bugs.python.org fields:

activity = <Date 2018-07-16.22:25:49.873>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2018-07-16.22:25:49.874>
closer = 'vstinner'
components = ['Library (Lib)']
creation = <Date 2010-05-24.09:44:15.789>
creator = 'kristjan.jonsson'
dependencies = []
files = ['17447', '25161', '25170', '32541']
hgrepos = []
issue_num = 8799
keywords = ['patch']
message_count = 28.0
messages = ['106349', '106355', '106360', '157844', '157847', '157942', '157944', '157968', '157972', '157973', '157976', '157980', '157984', '157987', '158013', '158017', '158019', '158020', '158022', '158023', '158062', '158076', '202319', '202420', '202614', '296610', '296611', '321776']
nosy_count = 7.0
nosy_names = ['csernazs', 'pitrou', 'kristjan.jonsson', 'vstinner', 'jyasskin', 'neologix', 'python-dev']
pr_nums = ['2320']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8799'
versions = ['Python 3.3']

kristjanvalur · 2010-05-24T09:44:14Z

The tests for the ConditionVariable a fragile. On a slow computer, or when running a DEBUG build, the main thread can issue a condition.notify() call, even though the worker threads have not settled in on a wait() call. This causes the test to stall.
The provided patch "fixes" this by adding a short _wait() before the notify.

pitrou · 2010-05-24T11:43:21Z

Strange, I have never noticed this (even on the buildbots).

kristjanvalur · 2010-05-24T12:11:45Z

Well, I saw this occasionally on my dual core laptop when working on the RWLock. A notify() before a target does a wait() is ineffective and it would wait indefinitely after the notify(5) call. This isn't a "proper" fix (I had another, more complex, where we would keep a counter for the number of successful "waiters" and wait for that to reach the target value) but it is in the spirit of many of these tests, where we rely on _wait() to be sufficient to reach a certain state.

kristjanvalur · 2012-04-09T13:29:01Z

Here is a new patch.

I´ve simplified and relaxed test_notify() for Condition objects. Condition variables don't guarantee that there won't be any spurious wakeups so the test must maintain internal bookkeeping so that it doesn't break with a different implementation of Condition.
The main thread now properly waits for the workers to "settle" in the test.
I've added two generic tests of Condition objects in their "natural environment" as building blocks for bigger things, a Queue and a Lock.

pitrou · 2012-04-09T15:01:45Z

Condition variables don't guarantee that there won't be any spurious
wakeups

What do you mean? The implementation doesn't seem prone to spurious wakeups, and the docs don't say so either.

I've added two generic tests of Condition objects in their "natural
environment" as building blocks for bigger things, a Queue and a Lock.

Why would you build a Lock out of a Condition?

kristjanvalur · 2012-04-10T10:05:20Z

We shouldn't be testing implementation details. The Condition variables are canonically prone to "spurious wakeups" and "stolen wakeups". The fact that the current implementation doesn't have them shouldn't place that burden on future implementations of threading.Condition on this platform.
From the docs: "Note: Condition variables can be, depending on the implementation, subject to both spurious wakeups (when wait() returns without a notify() call) and stolen wakeups (when another thread acquires the lock before the awoken thread.) For this reason, it is always necessary to verify the state the thread is waiting for when wait() returns and optionally repeat the call as often as necessary."

You would build a lock yourself if you wanted exact control over its behaviour. In this case, we are just doing what threading.Lock() does, but it could be more complicated. At any rate, it proves a useful toy usage of a condition variable in the context of making sure that it works properly as part of the unit tests.

I could leave the original test in place and just add the new ones, but the old test is at least provably a subject to a race condition as per the original issue.

pitrou · 2012-04-10T10:38:59Z

The Condition variables are canonically prone to "spurious wakeups"
and "stolen wakeups".

No, they aren't. Just because POSIX says they are doesn't mean *our*
condition variables are the same. Spurious wakeups are an annoyance, and
our implementation AFAICT never exhibited them.

From the docs: "Note: Condition variables can be, depending on the
implementation, subject to both spurious wakeups (when wait() returns
without a notify() call) and stolen wakeups (when another thread
acquires the lock before the awoken thread.) For this reason, it is
always necessary to verify the state the thread is waiting for when
wait() returns and optionally repeat the call as often as necessary."

Ah, thanks, indeed. Except that...
this was added by yourself in 483bbebc57bf, after bpo-10260, but
*without* being part of the original patch that you uploaded on that
issue.
So this never got reviewed and was instead sneaked in the docs in a
commit of yours.
Unless other people disagree, I think this addition should be reverted.

kristjanvalur · 2012-04-10T17:19:39Z

Stolen wakeups are a fact of life though, even in cpython's implementation of condition variables. And for the user of a condition variable the difference is so subtle as to not warrant special mention.

This is, btw, not just a posix thing. It is same on windows, java, and in fact, comes from the original definition of a condition variable as part of the "monitor" pattern. Condition variables are defined have this caveat to allow for flexibility in their implementation.
Look anywhere in the internets and you will find this, including here:
http://www.cs.berkeley.edu/~kubitron/courses/cs162/hand-outs/synch.html

The canonical and correct way to use a condition variable is to repeatedly wait() and verify that the wakeup condition holds. The wait_for() method abstracts this into a convenient helper. There is no need to befuddle the issue by special casing the python version as being guaranteed to not have suprious wakeups, but otherwise do have the stolen thread problem. Quite the contrary, we should be careful to never overspecify our interfaces so that we may not change their ipmlementation details later.

A good summary of the situation and the reason for the specification can be found here:
http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups

The comment I added to the documentation clarifies the reason why other examples in the docs were using while loops and teaches shows people how to use these constructs in a robust manner. Removing it is a bad idea and I think you will find others will agree. Take it up on python-dev if you want.

Unless you disagree, I'll add these test cases in addition to the broken one, which can then be fixed or removed later if it turns out to be problematic to other people.

neologix · 2012-04-10T18:01:50Z

> The Condition variables are canonically prone to "spurious wakeups"
> and "stolen wakeups".

No, they aren't. Just because POSIX says they are doesn't mean *our*
condition variables are the same. Spurious wakeups are an annoyance, and
our implementation AFAICT never exhibited them.

I agree with Antoine.
Spurious wakeups are an implementation issue (mainly has to do with
efficiency on SMP systems and interrupted syscalls), but are
definitely not a part of the "standard" API. Since our implementation
doesn't exhibit this problem, there's no reason to scare and confuse
people.
For example, our locks are special in the sense that they can be
released by a thread other than the one that acquired it, etc, and
this behavior is documented.

However, while I think the spurious wakeups reference could be
removed, I think that the advice of repeatedly chcking the invariant
shoud be kept, because it is always desirable (stolen wakeup, and
better logic).

pitrou · 2012-04-10T18:03:33Z

Stolen wakeups are a fact of life though, even in cpython's
implementation of condition variables. And for the user of a
condition variable the difference is so subtle as to not warrant
special mention.

I don't think it's a subtle difference. Being woken up too late is not
the same as being woken up spuriously.
Also, the term "stolen wakeup" makes it unnecessarily dramatic. There is
no "stealing" involved here.

Look anywhere in the internets and you will find this, including here:
http://www.cs.berkeley.edu/~kubitron/courses/cs162/hand-outs/synch.html

Thanks, will take a look.

Quite the contrary, we should be careful to never overspecify our
interfaces so that we may not change their ipmlementation details
later.

The interface is already, de facto, specified by its one and only
implementation, which has been in use for a long time. It's also
specified by its documentation, before it was silently modified by you.

There's probably code out there that relies on wait() not returning
prematurately.

A good summary of the situation and the reason for the specification
can be found here:
http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups

So the reason appears to be:

    “Spurious wakeups may sound strange, but on some multiprocessor
    systems, making condition wakeup completely predictable might
    substantially slow all condition variable operations.”

Python is a high-level language, so making condition variable ops
"substantially" slower is not a blocking issue for us.

What follows is even less enlightening:

    “The intent was to force correct/robust code by requiring
    predicate loops. This was  driven by the provably correct
    academic contingent among the "core threadies" in the working
    group, though I don't think anyone really disagreed with the
    intent once they understood what it meant.”

The comment I added to the documentation clarifies the reason why
other examples in the docs were using while loops and teaches shows
people how to use these constructs in a robust manner.

Teaching people to use loops is fine. But there's no need to mislead
them with inaccurate descriptions of the implementation's behaviour.

Unless you disagree, I'll add these test cases in addition to the
broken one, which can then be fixed or removed later if it turns out
to be problematic to other people.

Really, can you stop being aggressive? You haven't proved that test is
wrong, and it passes consistently fine on all buildbots.

If you don't agree with what is being tested, that's another matter. But
please accept that the test is ok w.r.t. Condition's specified
behaviour.

pitrou · 2012-04-10T18:39:35Z

I'm proposing the following changes to the threading docs. Most of them are cleanups and small improvements, but I also rewrote the offending paragraph, and made the wait_for example more proeminent.

neologix · 2012-04-10T20:21:11Z

I'm proposing the following changes to the threading docs. Most of them are cleanups and small improvements, but I also rewrote the offending paragraph, and made the wait_for example more proeminent.

Looks good to me.

python-dev · 2012-04-10T20:57:09Z

New changeset 2f51dca92883 by Antoine Pitrou in branch '3.2':
Issue bpo-8799: Fix and improve the threading.Condition documentation.
http://hg.python.org/cpython/rev/2f51dca92883

pitrou · 2012-04-10T20:59:47Z

Ok, I've fixed the docs in 3.2 and 3.3.

kristjanvalur · 2012-04-11T09:19:08Z

A few comments:
1)
with cv:
make_an_item_available()
+ cv.notify()

one of the benefits of wait_for() is that it automates the tricky timekeeping needed if one wants an somewhat accurate timeout, you may want to mention this specifically.
the offending part. I see that you don't want to use the technical terms of spurious wakeups or stolen wakeups. That's ok, but the resulting explanation is somwhat bogus. I suggest the following:

The ``while`` loop checking for the application's condition is necessary
because :meth:`~Condition.wait` can return after an arbitrary long time, and the condition which prompted the :meth:`~Condition.notify` call may not yet, or no longer, hold true. This is an inherent property of multi-threaded programming and the use of this pattern is essential for the robust use of Condition objects.

pitrou · 2012-04-11T10:12:20Z

A few comments:
1)
with cv:
make_an_item_available()
  cv.notify()

Did I forget this? Ow.

one of the benefits of wait_for() is that it automates the tricky
timekeeping needed if one wants an somewhat accurate timeout, you may
want to mention this specifically.

Ok.

the offending part. I see that you don't want to use the technical
terms of spurious wakeups or stolen wakeups. That's ok, but the
resulting explanation is somwhat bogus. I suggest the following:

The ``while`` loop checking for the application's condition is
necessary
because :meth:`~Condition.wait` can return after an arbitrary long
time, and the condition which prompted the :meth:`~Condition.notify`
call may not yet, or no longer, hold true. This is an inherent
property of multi-threaded programming and the use of this pattern is
essential for the robust use of Condition objects.

But, once again, "the condition may not yet hold true" is false.
However, the rest of the suggestion looks good, thanks.

kristjanvalur · 2012-04-11T10:50:35Z

But, once again, "the condition may not yet hold true" is false.
In our current implementation, yes. But it is intentionally left undefined in the specification of the condition variable protocol, for very good reasons.
While I'm fine with not mentioning it in the docs, I would be very much against us actually specifying the opposite (that early wakeups never occur) because this will unnecessarily limit our options. Since the while() loop pattern is already recommended because of the "stolen wakeup" problem, leaving the "early/spurious" wakeup behaviour" undefined is wise, since there is nothing to be gained by actually guaranteeing that there will be no early wakeups. This is just good software engineering practice.

This is also why we, IMHO, shouldn't rely on this behaviour in the unittests. One should never write tests that depend on unspecified behaviour, again, good engineering practice.

pitrou · 2012-04-11T11:00:01Z

Le mercredi 11 avril 2012 à 10:50 +0000, Kristján Valur Jónsson a
écrit :

> But, once again, "the condition may not yet hold true" is false.
In our current implementation, yes. But it is intentionally left
undefined in the specification of the condition variable protocol, for
very good reasons.

No, it is not "left undefined". If the documentation doesn't say
spurious wakeups may occur, then they are not supposed to occur.
Predictable behaviour is a good thing for users.

While I'm fine with not mentioning it in the docs, I would be very
much against us actually specifying the opposite (that early wakeups
never occur) because this will unnecessarily limit our options.

Which options?

This is also why we, IMHO, shouldn't rely on this behaviour in the
unittests.

Disagreed. Unit tests should definitely protect against the introduction
of bugs (willingly or not). And unpredictable behaviour is usually
considered a bug.

If you think the condition variable specification should be changed, you
can always ask for approval on python-dev. But I don't even see the
point: you are not demonstrating any *practical* advantage in doing so.

neologix · 2012-04-11T11:32:43Z

[...]

Disagreed. Unit tests should definitely protect against the introduction
of bugs (willingly or not). And unpredictable behaviour is usually
considered a bug.

If you think the condition variable specification should be changed, you
can always ask for approval on python-dev. But I don't even see the
point: you are not demonstrating any *practical* advantage in doing so.

One can imagine, for example, that another implementation (or maybe
CPython in a later version) exposes native condition variables on
POSIX, instead of emulating them (e.g. what happended for TLS
implementation).
In that case, we could get spurious wakeups, and code not designed to
cope with them would break.

pitrou · 2012-04-11T11:40:33Z

Le mercredi 11 avril 2012 à 11:32 +0000, Charles-François Natali a
écrit :

One can imagine, for example, that another implementation (or maybe
CPython in a later version) exposes native condition variables on
POSIX, instead of emulating them (e.g. what happended for TLS
implementation).

That's right, but spurious wakeups would be a good argument against
accepting such an implementation. Or that implementation should be fixed
to avoid them.

That said, we couldn't use the POSIX implementation anyway, since our
API allows the user to choose the type of lock, while POSIX AFAICT would
only accept a POSIX mutex (which doesn't have the same semantics as an
arbitrary Python lock).

pitrou · 2012-04-11T17:44:28Z

Ok, doc improved in 9d4109af8f3b.

neologix · 2012-04-11T21:23:30Z

Ok, doc improved in 9d4109af8f3b.

LGTM.

Kristján, how about updating your patch to only fix the original
problem you spotted (notify() called before wait()), then we can see
the remaining parts?

kristjanvalur · 2013-11-07T09:18:57Z

Ah, here we are some 18 months later. Let me have another go.

kristjanvalur · 2013-11-08T14:13:39Z

Here is an updated patch that only adds the necessary safety delay to the main thread. It also explains these in the comments.

python-dev · 2013-11-11T11:32:23Z

New changeset 9e48c9538a7f by Kristjan Valur Jonsson in branch 'default':
Issue bpo-8799: Reduce timing sensitivity of condition test by explicitly
http://hg.python.org/cpython/rev/9e48c9538a7f

vstinner · 2017-06-22T00:09:42Z

New changeset da6d305 by Victor Stinner in branch '2.7':
bpo-8799: Reduce timing sensitivity of condition test by explicitly (bpo-2320)
da6d305

vstinner · 2017-06-22T00:11:03Z

The race condition is not dead, at least in Python 2.7 :-) I backported the change for bpo-30727: "[2.7] test_threading.ConditionTests.test_notify() hangs randomly on Python 2.7".

vstinner · 2018-07-16T22:25:50Z

This issue discussed different things:

test_threading was unable: I didn't see such failure the last 12 months, and I backported enhancements from master to 2.7, so I consider this issue as fixed
documentation of threading locks: fixed
Except on Windows, Python 2.7 uses its own implementation of TLS: this issue has been fixed in Python 3.

All discussed issues have been fixed, so I close the issue.

kristjanvalur mannequin added easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 24, 2010

akuchling removed the easy label Nov 6, 2013

vstinner closed this as completed Jul 16, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in lib/test/test_threading.py #53045

Hang in lib/test/test_threading.py #53045

kristjanvalur mannequin commented May 24, 2010

kristjanvalur mannequin commented May 24, 2010

pitrou commented May 24, 2010

kristjanvalur mannequin commented May 24, 2010

kristjanvalur mannequin commented Apr 9, 2012

pitrou commented Apr 9, 2012

kristjanvalur mannequin commented Apr 10, 2012

pitrou commented Apr 10, 2012

kristjanvalur mannequin commented Apr 10, 2012

neologix mannequin commented Apr 10, 2012

pitrou commented Apr 10, 2012

pitrou commented Apr 10, 2012

neologix mannequin commented Apr 10, 2012

python-dev mannequin commented Apr 10, 2012

pitrou commented Apr 10, 2012

kristjanvalur mannequin commented Apr 11, 2012

pitrou commented Apr 11, 2012

kristjanvalur mannequin commented Apr 11, 2012

pitrou commented Apr 11, 2012

neologix mannequin commented Apr 11, 2012

pitrou commented Apr 11, 2012

pitrou commented Apr 11, 2012

neologix mannequin commented Apr 11, 2012

kristjanvalur mannequin commented Nov 7, 2013

kristjanvalur mannequin commented Nov 8, 2013

python-dev mannequin commented Nov 11, 2013

vstinner commented Jun 22, 2017

vstinner commented Jun 22, 2017

vstinner commented Jul 16, 2018

Hang in lib/test/test_threading.py #53045

Hang in lib/test/test_threading.py #53045

Comments

kristjanvalur mannequin commented May 24, 2010

kristjanvalur mannequin commented May 24, 2010

pitrou commented May 24, 2010

kristjanvalur mannequin commented May 24, 2010

kristjanvalur mannequin commented Apr 9, 2012

pitrou commented Apr 9, 2012

kristjanvalur mannequin commented Apr 10, 2012

pitrou commented Apr 10, 2012

kristjanvalur mannequin commented Apr 10, 2012

neologix mannequin commented Apr 10, 2012

pitrou commented Apr 10, 2012

pitrou commented Apr 10, 2012

neologix mannequin commented Apr 10, 2012

python-dev mannequin commented Apr 10, 2012

pitrou commented Apr 10, 2012

kristjanvalur mannequin commented Apr 11, 2012

pitrou commented Apr 11, 2012

kristjanvalur mannequin commented Apr 11, 2012

pitrou commented Apr 11, 2012

neologix mannequin commented Apr 11, 2012

pitrou commented Apr 11, 2012

pitrou commented Apr 11, 2012

neologix mannequin commented Apr 11, 2012

kristjanvalur mannequin commented Nov 7, 2013

kristjanvalur mannequin commented Nov 8, 2013

python-dev mannequin commented Nov 11, 2013

vstinner commented Jun 22, 2017

vstinner commented Jun 22, 2017

vstinner commented Jul 16, 2018