Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in lib/test/test_threading.py #53045

Closed
kristjanvalur mannequin opened this issue May 24, 2010 · 28 comments
Closed

Hang in lib/test/test_threading.py #53045

kristjanvalur mannequin opened this issue May 24, 2010 · 28 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kristjanvalur
Copy link
Mannequin

kristjanvalur mannequin commented May 24, 2010

BPO 8799
Nosy @csernazs, @pitrou, @kristjanvalur, @vstinner
PRs
  • [2.7] bpo-8799: Reduce timing sensitivity of condition test by explicitly #2320
  • Files
  • locktest.patch
  • lock_tests.diff
  • tpconditiondoc.patch
  • lock_tests.patch: updated ro
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-07-16.22:25:49.874>
    created_at = <Date 2010-05-24.09:44:15.789>
    labels = ['type-bug', 'library']
    title = 'Hang in lib/test/test_threading.py'
    updated_at = <Date 2018-07-16.22:25:49.873>
    user = 'https://github.com/kristjanvalur'

    bugs.python.org fields:

    activity = <Date 2018-07-16.22:25:49.873>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-07-16.22:25:49.874>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2010-05-24.09:44:15.789>
    creator = 'kristjan.jonsson'
    dependencies = []
    files = ['17447', '25161', '25170', '32541']
    hgrepos = []
    issue_num = 8799
    keywords = ['patch']
    message_count = 28.0
    messages = ['106349', '106355', '106360', '157844', '157847', '157942', '157944', '157968', '157972', '157973', '157976', '157980', '157984', '157987', '158013', '158017', '158019', '158020', '158022', '158023', '158062', '158076', '202319', '202420', '202614', '296610', '296611', '321776']
    nosy_count = 7.0
    nosy_names = ['csernazs', 'pitrou', 'kristjan.jonsson', 'vstinner', 'jyasskin', 'neologix', 'python-dev']
    pr_nums = ['2320']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue8799'
    versions = ['Python 3.3']

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented May 24, 2010

    The tests for the ConditionVariable a fragile. On a slow computer, or when running a DEBUG build, the main thread can issue a condition.notify() call, even though the worker threads have not settled in on a wait() call. This causes the test to stall.
    The provided patch "fixes" this by adding a short _wait() before the notify.

    @kristjanvalur kristjanvalur mannequin added easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 24, 2010
    @pitrou
    Copy link
    Member

    pitrou commented May 24, 2010

    Strange, I have never noticed this (even on the buildbots).

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented May 24, 2010

    Well, I saw this occasionally on my dual core laptop when working on the RWLock. A notify() before a target does a wait() is ineffective and it would wait indefinitely after the notify(5) call. This isn't a "proper" fix (I had another, more complex, where we would keep a counter for the number of successful "waiters" and wait for that to reach the target value) but it is in the spirit of many of these tests, where we rely on _wait() to be sufficient to reach a certain state.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 9, 2012

    Here is a new patch.

    1. I´ve simplified and relaxed test_notify() for Condition objects. Condition variables don't guarantee that there won't be any spurious wakeups so the test must maintain internal bookkeeping so that it doesn't break with a different implementation of Condition.
    2. The main thread now properly waits for the workers to "settle" in the test.
    3. I've added two generic tests of Condition objects in their "natural environment" as building blocks for bigger things, a Queue and a Lock.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 9, 2012

    Condition variables don't guarantee that there won't be any spurious
    wakeups

    What do you mean? The implementation doesn't seem prone to spurious wakeups, and the docs don't say so either.

    I've added two generic tests of Condition objects in their "natural
    environment" as building blocks for bigger things, a Queue and a Lock.

    Why would you build a Lock out of a Condition?

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 10, 2012

    We shouldn't be testing implementation details. The Condition variables are canonically prone to "spurious wakeups" and "stolen wakeups". The fact that the current implementation doesn't have them shouldn't place that burden on future implementations of threading.Condition on this platform.
    From the docs: "Note: Condition variables can be, depending on the implementation, subject to both spurious wakeups (when wait() returns without a notify() call) and stolen wakeups (when another thread acquires the lock before the awoken thread.) For this reason, it is always necessary to verify the state the thread is waiting for when wait() returns and optionally repeat the call as often as necessary."

    1. You would build a lock yourself if you wanted exact control over its behaviour. In this case, we are just doing what threading.Lock() does, but it could be more complicated. At any rate, it proves a useful toy usage of a condition variable in the context of making sure that it works properly as part of the unit tests.

    I could leave the original test in place and just add the new ones, but the old test is at least provably a subject to a race condition as per the original issue.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 10, 2012

    The Condition variables are canonically prone to "spurious wakeups"
    and "stolen wakeups".

    No, they aren't. Just because POSIX says they are doesn't mean *our*
    condition variables are the same. Spurious wakeups are an annoyance, and
    our implementation AFAICT never exhibited them.

    From the docs: "Note: Condition variables can be, depending on the
    implementation, subject to both spurious wakeups (when wait() returns
    without a notify() call) and stolen wakeups (when another thread
    acquires the lock before the awoken thread.) For this reason, it is
    always necessary to verify the state the thread is waiting for when
    wait() returns and optionally repeat the call as often as necessary."

    Ah, thanks, indeed. Except that...
    this was added by yourself in 483bbebc57bf, after bpo-10260, but
    *without* being part of the original patch that you uploaded on that
    issue.
    So this never got reviewed and was instead sneaked in the docs in a
    commit of yours.
    Unless other people disagree, I think this addition should be reverted.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 10, 2012

    Stolen wakeups are a fact of life though, even in cpython's implementation of condition variables. And for the user of a condition variable the difference is so subtle as to not warrant special mention.

    This is, btw, not just a posix thing. It is same on windows, java, and in fact, comes from the original definition of a condition variable as part of the "monitor" pattern. Condition variables are defined have this caveat to allow for flexibility in their implementation.
    Look anywhere in the internets and you will find this, including here:
    http://www.cs.berkeley.edu/~kubitron/courses/cs162/hand-outs/synch.html

    The canonical and correct way to use a condition variable is to repeatedly wait() and verify that the wakeup condition holds. The wait_for() method abstracts this into a convenient helper. There is no need to befuddle the issue by special casing the python version as being guaranteed to not have suprious wakeups, but otherwise do have the stolen thread problem. Quite the contrary, we should be careful to never overspecify our interfaces so that we may not change their ipmlementation details later.

    A good summary of the situation and the reason for the specification can be found here:
    http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups

    The comment I added to the documentation clarifies the reason why other examples in the docs were using while loops and teaches shows people how to use these constructs in a robust manner. Removing it is a bad idea and I think you will find others will agree. Take it up on python-dev if you want.

    Unless you disagree, I'll add these test cases in addition to the broken one, which can then be fixed or removed later if it turns out to be problematic to other people.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 10, 2012

    > The Condition variables are canonically prone to "spurious wakeups"
    > and "stolen wakeups".

    No, they aren't. Just because POSIX says they are doesn't mean *our*
    condition variables are the same. Spurious wakeups are an annoyance, and
    our implementation AFAICT never exhibited them.

    I agree with Antoine.
    Spurious wakeups are an implementation issue (mainly has to do with
    efficiency on SMP systems and interrupted syscalls), but are
    definitely not a part of the "standard" API. Since our implementation
    doesn't exhibit this problem, there's no reason to scare and confuse
    people.
    For example, our locks are special in the sense that they can be
    released by a thread other than the one that acquired it, etc, and
    this behavior is documented.

    However, while I think the spurious wakeups reference could be
    removed, I think that the advice of repeatedly chcking the invariant
    shoud be kept, because it is always desirable (stolen wakeup, and
    better logic).

    @pitrou
    Copy link
    Member

    pitrou commented Apr 10, 2012

    Stolen wakeups are a fact of life though, even in cpython's
    implementation of condition variables. And for the user of a
    condition variable the difference is so subtle as to not warrant
    special mention.

    I don't think it's a subtle difference. Being woken up too late is not
    the same as being woken up spuriously.
    Also, the term "stolen wakeup" makes it unnecessarily dramatic. There is
    no "stealing" involved here.

    Look anywhere in the internets and you will find this, including here:
    http://www.cs.berkeley.edu/~kubitron/courses/cs162/hand-outs/synch.html

    Thanks, will take a look.

    Quite the contrary, we should be careful to never overspecify our
    interfaces so that we may not change their ipmlementation details
    later.

    The interface is already, de facto, specified by its one and only
    implementation, which has been in use for a long time. It's also
    specified by its documentation, before it was silently modified by you.

    There's probably code out there that relies on wait() not returning
    prematurately.

    A good summary of the situation and the reason for the specification
    can be found here:
    http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups

    So the reason appears to be:

        “Spurious wakeups may sound strange, but on some multiprocessor
        systems, making condition wakeup completely predictable might
        substantially slow all condition variable operations.”
    

    Python is a high-level language, so making condition variable ops
    "substantially" slower is not a blocking issue for us.

    What follows is even less enlightening:

        “The intent was to force correct/robust code by requiring
        predicate loops. This was  driven by the provably correct
        academic contingent among the "core threadies" in the working
        group, though I don't think anyone really disagreed with the
        intent once they understood what it meant.”
    

    The comment I added to the documentation clarifies the reason why
    other examples in the docs were using while loops and teaches shows
    people how to use these constructs in a robust manner.

    Teaching people to use loops is fine. But there's no need to mislead
    them with inaccurate descriptions of the implementation's behaviour.

    Unless you disagree, I'll add these test cases in addition to the
    broken one, which can then be fixed or removed later if it turns out
    to be problematic to other people.

    Really, can you stop being aggressive? You haven't proved that test is
    wrong, and it passes consistently fine on all buildbots.

    If you don't agree with what is being tested, that's another matter. But
    please accept that the test is ok w.r.t. Condition's specified
    behaviour.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 10, 2012

    I'm proposing the following changes to the threading docs. Most of them are cleanups and small improvements, but I also rewrote the offending paragraph, and made the wait_for example more proeminent.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 10, 2012

    I'm proposing the following changes to the threading docs. Most of them are cleanups and small improvements, but I also rewrote the offending paragraph, and made the wait_for example more proeminent.

    Looks good to me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 10, 2012

    New changeset 2f51dca92883 by Antoine Pitrou in branch '3.2':
    Issue bpo-8799: Fix and improve the threading.Condition documentation.
    http://hg.python.org/cpython/rev/2f51dca92883

    @pitrou
    Copy link
    Member

    pitrou commented Apr 10, 2012

    Ok, I've fixed the docs in 3.2 and 3.3.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 11, 2012

    A few comments:
    1)
    with cv:
    make_an_item_available()
    + cv.notify()

    1. one of the benefits of wait_for() is that it automates the tricky timekeeping needed if one wants an somewhat accurate timeout, you may want to mention this specifically.

    2. the offending part. I see that you don't want to use the technical terms of spurious wakeups or stolen wakeups. That's ok, but the resulting explanation is somwhat bogus. I suggest the following:

    The ``while`` loop checking for the application's condition is necessary
    because :meth:`~Condition.wait` can return after an arbitrary long time, and the condition which prompted the :meth:`~Condition.notify` call may not yet, or no longer, hold true. This is an inherent property of multi-threaded programming and the use of this pattern is essential for the robust use of Condition objects.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2012

    A few comments:
    1)
    with cv:
    make_an_item_available()

    •   cv.notify()
      

    Did I forget this? Ow.

    1. one of the benefits of wait_for() is that it automates the tricky
      timekeeping needed if one wants an somewhat accurate timeout, you may
      want to mention this specifically.

    Ok.

    1. the offending part. I see that you don't want to use the technical
      terms of spurious wakeups or stolen wakeups. That's ok, but the
      resulting explanation is somwhat bogus. I suggest the following:

    The ``while`` loop checking for the application's condition is
    necessary
    because :meth:`~Condition.wait` can return after an arbitrary long
    time, and the condition which prompted the :meth:`~Condition.notify`
    call may not yet, or no longer, hold true. This is an inherent
    property of multi-threaded programming and the use of this pattern is
    essential for the robust use of Condition objects.

    But, once again, "the condition may not yet hold true" is false.
    However, the rest of the suggestion looks good, thanks.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Apr 11, 2012

    But, once again, "the condition may not yet hold true" is false.
    In our current implementation, yes. But it is intentionally left undefined in the specification of the condition variable protocol, for very good reasons.
    While I'm fine with not mentioning it in the docs, I would be very much against us actually specifying the opposite (that early wakeups never occur) because this will unnecessarily limit our options. Since the while() loop pattern is already recommended because of the "stolen wakeup" problem, leaving the "early/spurious" wakeup behaviour" undefined is wise, since there is nothing to be gained by actually guaranteeing that there will be no early wakeups. This is just good software engineering practice.

    This is also why we, IMHO, shouldn't rely on this behaviour in the unittests. One should never write tests that depend on unspecified behaviour, again, good engineering practice.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2012

    Le mercredi 11 avril 2012 à 10:50 +0000, Kristján Valur Jónsson a
    écrit :

    > But, once again, "the condition may not yet hold true" is false.
    In our current implementation, yes. But it is intentionally left
    undefined in the specification of the condition variable protocol, for
    very good reasons.

    No, it is not "left undefined". If the documentation doesn't say
    spurious wakeups may occur, then they are not supposed to occur.
    Predictable behaviour is a good thing for users.

    While I'm fine with not mentioning it in the docs, I would be very
    much against us actually specifying the opposite (that early wakeups
    never occur) because this will unnecessarily limit our options.

    Which options?

    This is also why we, IMHO, shouldn't rely on this behaviour in the
    unittests.

    Disagreed. Unit tests should definitely protect against the introduction
    of bugs (willingly or not). And unpredictable behaviour is usually
    considered a bug.

    If you think the condition variable specification should be changed, you
    can always ask for approval on python-dev. But I don't even see the
    point: you are not demonstrating any *practical* advantage in doing so.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 11, 2012

    [...]

    Disagreed. Unit tests should definitely protect against the introduction
    of bugs (willingly or not). And unpredictable behaviour is usually
    considered a bug.

    If you think the condition variable specification should be changed, you
    can always ask for approval on python-dev. But I don't even see the
    point: you are not demonstrating any *practical* advantage in doing so.

    One can imagine, for example, that another implementation (or maybe
    CPython in a later version) exposes native condition variables on
    POSIX, instead of emulating them (e.g. what happended for TLS
    implementation).
    In that case, we could get spurious wakeups, and code not designed to
    cope with them would break.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2012

    Le mercredi 11 avril 2012 à 11:32 +0000, Charles-François Natali a
    écrit :

    One can imagine, for example, that another implementation (or maybe
    CPython in a later version) exposes native condition variables on
    POSIX, instead of emulating them (e.g. what happended for TLS
    implementation).

    That's right, but spurious wakeups would be a good argument against
    accepting such an implementation. Or that implementation should be fixed
    to avoid them.

    That said, we couldn't use the POSIX implementation anyway, since our
    API allows the user to choose the type of lock, while POSIX AFAICT would
    only accept a POSIX mutex (which doesn't have the same semantics as an
    arbitrary Python lock).

    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2012

    Ok, doc improved in 9d4109af8f3b.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 11, 2012

    Ok, doc improved in 9d4109af8f3b.

    LGTM.

    Kristján, how about updating your patch to only fix the original
    problem you spotted (notify() called before wait()), then we can see
    the remaining parts?

    @akuchling akuchling removed the easy label Nov 6, 2013
    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 7, 2013

    Ah, here we are some 18 months later. Let me have another go.

    @kristjanvalur
    Copy link
    Mannequin Author

    kristjanvalur mannequin commented Nov 8, 2013

    Here is an updated patch that only adds the necessary safety delay to the main thread. It also explains these in the comments.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 11, 2013

    New changeset 9e48c9538a7f by Kristjan Valur Jonsson in branch 'default':
    Issue bpo-8799: Reduce timing sensitivity of condition test by explicitly
    http://hg.python.org/cpython/rev/9e48c9538a7f

    @vstinner
    Copy link
    Member

    New changeset da6d305 by Victor Stinner in branch '2.7':
    bpo-8799: Reduce timing sensitivity of condition test by explicitly (bpo-2320)
    da6d305

    @vstinner
    Copy link
    Member

    The race condition is not dead, at least in Python 2.7 :-) I backported the change for bpo-30727: "[2.7] test_threading.ConditionTests.test_notify() hangs randomly on Python 2.7".

    @vstinner
    Copy link
    Member

    This issue discussed different things:

    • test_threading was unable: I didn't see such failure the last 12 months, and I backported enhancements from master to 2.7, so I consider this issue as fixed
    • documentation of threading locks: fixed
    • Except on Windows, Python 2.7 uses its own implementation of TLS: this issue has been fixed in Python 3.

    All discussed issues have been fixed, so I close the issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants