Title: 3.3 _thread lock.acquire() timeout and threading.Event().wait() do not wake for certain values on Windows
Type: behavior Stage: patch review
Components: Library (Lib), Windows Versions: Python 3.3, Python 3.4
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alex.gronholm, brian.curtin, kristjan.jonsson, mark.dickinson, pitrou, python-dev, r.david.murray, raruler, tim.golden
Priority: normal Keywords: patch

Created on 2014-02-22 22:20 by raruler, last changed 2014-05-08 11:21 by kristjan.jonsson. This issue is now closed.

File name Uploaded Description Edit
issue20737.condvar.patch tim.golden, 2014-05-07 10:20 review
issue20737.condvar.2.patch tim.golden, 2014-05-07 19:11 review
condwait.patch kristjan.jonsson, 2014-05-07 21:49 review
condwait.patch kristjan.jonsson, 2014-05-07 21:51 review tim.golden, 2014-05-08 10:15
Messages (24)
msg211966 - (view) Author: (raruler) Date: 2014-02-22 22:20
I've tried this with both the 32-bit and 64-bit versions of 3.3.4 on two Windows 7 x64 machines. 

threading.Event().wait(2148) and a lock obtained from _thread used as lock.acquire(True, 2148) will never return. Anything under 2148 seems to work just fine, but anything 2148 or larger never wakes up.

The same call works on 3.3.4/x64 on OS X.
msg218004 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-06 19:04
In thread_nt.h:EnterNonRecursiveMutex the millisecond version (here: 2148000) of the seconds you passed in are converted to microseconds (so: 2148000000).

This is then passed to condvar.h:PyCOND_TIMEDWAIT which expects a long, whose 32-bit limit is 2147483647. So it wraps around and becomes -2146967296. 

That value is then divided by 1000 once again to become milliseconds (-2146967) and passed to condvar.h:_PyCOND_WAIT_MS which expects a DWORD, which is a synonym for an unsigned long. Thus the signed value becomes an unsigned 4292820329. Which then passed in as the millisecond timeout to WaitForSingleObjectEx.

So that's what's happening; but I'm really not sure at what stage a change should be made. It looks to me as if PyCOND_TIMEDWAIT should be accepting something more than a long, but I'm really not competent to assess the impact of such a change.
msg218045 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-07 10:20
The attached patch uses DWORD (essentially: unsigned long) in condvar.h:PyCOND_TIMEDWAIT.

Adding Kristjan as it was his code.
msg218047 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 11:06
changing long to DWORD doesn't really fix the overflow issue.
The fundamental problem is that some of the apis, e.g. WaitForSingleObject have a DWORD maximum.  so, we cannot support sleep times longer than some particular time.

Microseconds was chosen in the api because that is the resolution of the api in pthreads.

IMHO, I think it is okay to have an implicit ceiling on the timeout, e.g. some 4000 seconds.  We can add a caveat somewhere that anyone intending to sleep for extended periods of time should be prepared for a timeout occurring early, and should have his own timing logic to deal with that.

My suggestion then is to 
a) change the apis to DWORD
b) add a macro something like PyCOND_MAX_WAIT set to 2^32-1
c) properly clip the argument where we call this cunfion, e.g. in lock.acquire.
msg218055 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-07 14:14
Thanks for the feedback, Kristjan. You're obviously correct in that we
can't account for timeouts greater than DWORD-size milliseconds and your
proposed solution looks reasonable.

However, I'd like to close off *this* particular issue which turns on
the implicit and, presumably unintended, conversion between unsigned and
signed long in the condition variable code. Can you see any adverse
effect from moving to DWORD parameters per my patch?
msg218056 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-07 14:24
An implicit ceiling of 4000 seconds on the timeout?  I routinely use timeouts of approximately 24 hours in calls to Event().wait().  What am I misunderstanding?  If I'm not misunderstanding, then no, I don't think that change would be acceptable.
msg218058 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 15:15
Tim, how about changing the variable to "unsigned long"?  I'd like the signature of the function to be the same for all platforms.

This will change the code and allow waits for up to 4000 seconds.

There is still an overflow problem present, though.+

David, in general the maximum wait times of these primitives are platform specific.  If you don't want any ceiling, then we would have to add code all over the place (in C) to do looping timeouts.  Not sure which is better, to do it in c, or to accept in python that waits may timeout earlier than specified.
msg218059 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-07 16:02
I have production code on Windows using python2.7 that calls Event().wait() with a timeout of approximately 24 hours, and it works just fine.  Having that no longer work is, IMO, an unacceptable regression.  (I'm ready to move this code to python3 as soon as cx_Freeze supports Windows services under python3).
msg218060 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-07 16:20
Just to be clear: the change *I'm* proposing for this issue has nothing 
to do with limiting the wait, artificially or otherwise. It's simply 
undoing an unintended conversion from unsigned to signed and back again, 
whicih currently causes any wait of more than 2147 seconds to hang 
pretty much indefinitely.

It looks to me as though Kristjan's "4000" is off by an order of 
magnitude: the parameter to WaitFor... is a DWORD number of 
milliseconds. That allows for 2**31 -1 milliseconds which is something 
short of 50 days, I believe.
msg218071 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-07 19:11
Updated patch with unsigned long applied throughout
msg218077 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 21:11
Hi there.
When I said "4000", that was because of the conversion to microseconds which happens early on.  

I'm not trying to be difficult here Tim, it's just that you've pointed out a problem and I'd like us to have a comprehensive fix.

"unsigned long", I realized, is also not super, because on unix that can be either 32 or 64 bits :)

The reason 24 hour waits work on 2.7 is that the conversion to microseconds is never done, rather it uses a DWORD of milliseconds.  I agree that this is a regression that needs fixing.  Even if there is a theroetical maximum, it should be higher than that :)

My latest suggestion?  Let's just go ahead and use a "double" for the argument in PyCOND_TIMEDWAIT().

We then have two conversion cases:
1) to a DWORD of milliseconds for both windows apis.  Here we should truncate to the max size of a DWORD
2) to the timeval used on pthreads.

for 1, that can be done like:
if (ds*1e3 > (double)DWORD_MAX)
  ms = DWORD_MAX;
  ms = (DWORD)(ds*1e3)

for 2, modifying the PyCOND_ADD_MICROSECONDS macro into something like:

do {
 long oldsec, sec, usec;
 assert(ds >= 0.0);
 // truncate ds into theoretical maximum
 if (ds > (double)long_max)
    ds = (double)long_max; // whatever that may be
 sec = (long)ds;
 usec = (long)((ds - (double)sec) * 1e6))
 oldsec = tv.tv_sec;
 tv.tv_usec += usec;
 tv.tv_sec += sec;
 if (usec >= 1000000) {
   tv.tv_usec -= 1000000;
   tv.tv_sec += 1;
 if (tv.tv_sec < oldsec)
    /* detect overflow */
    tv.sec = max_long;

I'm not super experienced with integer arithmetic like this or the pitfalls of overflow, so this might need some pondering.  Perhaps it is better to do the tv_sec and tv_usec arithmetic in doubles before converting them back.

Does this sound ok?

Let me see if I can cook up an alternative patch.
msg218078 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-05-07 21:15
I'm adding Mark Dickinson to nosy to see if our math expert has a comment on the arithmetic :)
msg218079 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 21:40
Ah, I saw this code here in thread_nt.h:

if ((DWORD) milliseconds != milliseconds)
            Py_FatalError("Timeout too large for a DWORD, "
                           "please check PY_TIMEOUT_MAX");

the PyCOND_TIMEDWAIT is currently only used by the GIL code and by the locks on NT.  The GIL code assumes microsecond resolution.  So we need to stick to that, at least.  But the locking code assumes at least DWORD worth of milliseconds.
msg218080 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 21:41
so, I suggest that we modify the API to use "Py_LONG_LONG usec"

Does that sound reasonable?
msg218081 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 21:49
Here is a proposed alternative patch.
No additional checks, just a wider "Py_LONG_LONG us" wide enough to accommodate 32 bits of milliseconds as before.
msg218082 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-07 21:51
fix patch, was using git format....
msg218096 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-08 07:29
+1 for Kristjan's latest patch. And thanks for working this through, Kristjan: I'd missed the fact that the microseconds conversion could itself overflow even an unsigned long.
msg218098 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-08 09:12
Thanks.  Can you confirm that it resolves the issue?  I'll get it checked in once I get the regrtest suite run.
msg218099 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-08 09:32
msg218100 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-08 09:33
I see, I wasn't able to compile it yesterday when I did it :)
msg218104 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014-05-08 10:15
I can confirm that the attached times out after 2150 seconds (ie
30+ minutes) with your (tweaked) patch applied:

python 2150

Running Debug|Win32 interpreter...
2014-05-08 10:33:53.670091
Expected to time out by 2014-05-08 11:09:43.670091
Timed Out
2014-05-08 11:09:43.765079

I don't see us adding a test which would lengthen the test suite run by
more than 30 minutes!
msg218105 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2014-05-08 10:16
nope, let's not do that :)
msg218106 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-05-08 10:37
New changeset 80b76c6fad44 by Kristján Valur Jónsson in branch 'default':
The PyCOND_TIMEDWAIT must use microseconds for the timeout argument
msg218108 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-05-08 11:19
New changeset ab5e2b0fba15 by Kristján Valur Jónsson in branch '3.3':
The PyCOND_TIMEDWAIT must use microseconds for the timeout argument

New changeset 7764bb7f2983 by Kristján Valur Jónsson in branch '3.4':
Merging from 3.3: The PyCOND_TIMEDWAIT must use microseconds for the timeout argument
Date User Action Args
2014-05-08 11:21:27kristjan.jonssonsetstatus: open -> closed
resolution: fixed
2014-05-08 11:19:31python-devsetmessages: + msg218108
2014-05-08 10:37:49python-devsetnosy: + python-dev
messages: + msg218106
2014-05-08 10:16:51kristjan.jonssonsetmessages: + msg218105
2014-05-08 10:15:38tim.goldensetfiles: +

messages: + msg218104
2014-05-08 09:33:38kristjan.jonssonsetmessages: + msg218100
2014-05-08 09:32:12tim.goldensetmessages: + msg218099
2014-05-08 09:12:08kristjan.jonssonsetmessages: + msg218098
2014-05-08 07:29:01tim.goldensetmessages: + msg218096
2014-05-07 21:51:03kristjan.jonssonsetfiles: + condwait.patch

messages: + msg218082
2014-05-07 21:49:37kristjan.jonssonsetfiles: + condwait.patch

messages: + msg218081
2014-05-07 21:41:23kristjan.jonssonsetmessages: + msg218080
2014-05-07 21:40:21kristjan.jonssonsetmessages: + msg218079
2014-05-07 21:15:05r.david.murraysetnosy: + mark.dickinson
messages: + msg218078
2014-05-07 21:11:51kristjan.jonssonsetmessages: + msg218077
2014-05-07 19:11:24tim.goldensetfiles: + issue20737.condvar.2.patch

messages: + msg218071
2014-05-07 16:20:54tim.goldensetmessages: + msg218060
2014-05-07 16:02:48r.david.murraysetmessages: + msg218059
2014-05-07 15:15:06kristjan.jonssonsetmessages: + msg218058
2014-05-07 14:24:31r.david.murraysetmessages: + msg218056
2014-05-07 14:14:34tim.goldensetmessages: + msg218055
2014-05-07 11:06:20kristjan.jonssonsetmessages: + msg218047
2014-05-07 10:20:31tim.goldensetstage: patch review
2014-05-07 10:20:18tim.goldensetfiles: + issue20737.condvar.patch

nosy: + kristjan.jonsson
messages: + msg218045

keywords: + patch
2014-05-06 19:04:20tim.goldensetmessages: + msg218004
2014-02-24 22:51:18alex.gronholmsetnosy: + alex.gronholm
2014-02-23 06:19:35r.david.murraysetnosy: + r.david.murray
2014-02-22 22:24:48pitrousetnosy: + tim.golden, brian.curtin

versions: + Python 3.4
2014-02-22 22:20:26rarulercreate