classification
Title: queue.Queue(timeout=0.001) avg delay Windows:14.5ms, Ubuntu: 0.063ms
Type: performance Stage:
Components: Windows Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Gammaguy, eryksun, josh.r, paul.moore, rhettinger, steve.dower, terry.reedy, tim.golden, vstinner, xtreak, zach.ware
Priority: normal Keywords:

Created on 2018-08-28 19:20 by Gammaguy, last changed 2018-09-03 04:25 by terry.reedy.

Messages (11)
msg324277 - (view) Author: Paul Overing (Gammaguy) Date: 2018-08-28 19:20
I have found that using timeout with a python queue.Queue() results in an average delay of  14.5ms in Windows over Ubuntu in python 3.6.5/3.7, I subtracted the 1 ms timeout.

I have also tried datetime.datetime.now() to measure the time, with similar results. 

Does that make sense?


import queue
import time

max=0
min=10000
num = 10000
sum = 0
for x in range(10000):
    q = queue.Queue()
    start = time.perf_counter()
    try:
        msg = q.get(block=True, timeout=.001)
    except:
        end = time.perf_counter()

    chng = end-start -.001

    if chng > max:
        max = chng
    if chng < min:
        min = chng
    sum = sum + chng

print(f"Avg: {sum/num}")
print(f"Min: {min}")
print(f"Max: {max}")

Results:

Window 10, Python 3.6.5/3.7
Avg: 0.014549868429997701
Min: 0.001674142000000046
Max: 0.13890251500000284

Ubuntu 16.04, Python 3.6.5/3.7
Avg: 6.275181290839012e-05
Min: 4.89290034456644e-05
Max: 8.690999695681965e-05
msg324299 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-08-29 04:14
Victor or Steve, do you know the resolution of perf_counter() and monotonic() on Windows?
msg324308 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-29 08:30
> Victor or Steve, do you know the resolution of perf_counter() and monotonic() on Windows?

You can use:

>>> time.get_clock_info('perf_counter')
namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)

perf_counter() has usually a resolution much smaller than 1 ms (ex: 100 ns), whereas monotonic() has a bad resolution (around 15.6 ms).

Rounding matters for the monotonic clock on Windows :-) Python should round away from zero (ROUND_UP) for timeouts.

--

> I have found that using timeout with a python queue.Queue() results in an average delay of  14.5ms in Windows over Ubuntu in python 3.6.5/3.7, I subtracted the 1 ms timeout.

This behaviour looks correct to me. I suggest to close the issue.

Python rely on Windows API. Queue.get() uses a threading.Lock. On Windows, lock.acquire(timeout=timeout) uses EnterNonRecursiveMutex() which expects a number of a milliseconds. Even if you succeed to enhance the resolution of the system clock, you will still be stuck to a resolution of 1 ms because of the Windows API.

I'm not sure of what is your underlying question. Does it look inefficient to you? You can use threads or asyncio to write more efficient code. But this is out of the scope of the bug tracker :-)

I suggest to close this issue as "not a bug".
msg324320 - (view) Author: Paul Overing (Gammaguy) Date: 2018-08-29 12:24
Thank you for your quick response.  But 1ms resolution or not, 14.5ms does not seem reasonable given that in Ubuntu it is 0.049ms. Ubuntu is on average 295x faster.  

Also the variability in times is much higher in Windows max/avg = 9.547, avg/min =  8.69.

It makes the windows queue.Queue implementations with timeouts unusable in low latency environments and should be documented as such. As it is slow and highly variable.
msg324328 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-29 15:56
>  It makes the windows queue.Queue implementations with timeouts unusable in low latency environments and should be documented as such. As it is slow and highly variable.

Well, it seems like you have a solution which is called Linux, no?
msg324342 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2018-08-29 17:44
Victor, that was a little overboard. By that logic, there doesn't need to be a Windows version of Python.

That said, Paul doesn't seem to understand that the real resolution limit isn't 1 ms; that's the lower limit on arguments to the API, but the real limit is the system clock, which has a granularity in the 10-16 ms range. It's a problem with Windows in general, and the cure is worse than the disease.

Per https://msdn.microsoft.com/en-us/library/windows/desktop/ms724411(v=vs.85).aspx , the resolution of the system timer is typically in the range of 10 milliseconds to 16 milliseconds.

Per https://docs.microsoft.com/en-us/windows/desktop/Sync/wait-functions#wait-functions-and-time-out-intervals :

> Wait Functions and Time-out Intervals

> The accuracy of the specified time-out interval depends on the resolution of the system clock. The system clock "ticks" at a constant rate. If the time-out interval is less than the resolution of the system clock, the wait may time out in less than the specified length of time. If the time-out interval is greater than one tick but less than two, the wait can be anywhere between one and two ticks, and so on.

All the Windows synchronization primitives (e.g. WaitForSingleObjectEx https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobjectex , which is what ultimately implements timed lock acquisition on Windows) are based on the system clock, so without drastic measures, it's impossible to get better granularity than the 10-16 ms of the default system clock configuration.

The link on "Wait Functions and Time-out Intervals" does mention that this granularity *can* be increased, but it recommends against fine-grained tuning (so you can't just tweak it before a wait and undo the tweak after; the only safe thing to do is change it on program launch and undo it on program exit). Even then, it's a bad idea for Python to use it; per timeBeginPeriod's own docs ( https://docs.microsoft.com/en-us/windows/desktop/api/timeapi/nf-timeapi-timebeginperiod ):

> This function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process. Setting a higher resolution can improve the accuracy of time-out intervals in wait functions. However, it can also reduce overall system performance, because the thread scheduler switches tasks more often. High resolutions can also prevent the CPU power management system from entering power-saving modes. Setting a higher resolution does not improve the accuracy of the high-resolution performance counter.

Basically, to improve the resolution of timed lock acquisition, we'd have to change the performance profile of the entire OS while Python was running, likely increasing power usage and possibly reducing performance. Global solutions to local problems are a bad idea.

The most reasonable solution to the problem is to simply document it (maybe not for queue.Queue, but for the threading module). Possibly even provide an attribute in the threading module similar to  threading.TIMEOUT_MAX that reports the system clock's granularity for informational purposes (might need to be a function so it reports the potentially changing granularity).

Other, less reasonable solutions, would be:

1. Expose a function (with prominent warnings about not using it in a fine grained manner, and the effects on power management and performance) that would increase the system clock granularity as much as possible timeGetDevCaps reports possible (possibly limited to a user provided suggestion, so while the clock could go to 1 ms resolution, the user could request only 5 ms resolution to reduce the costs of doing so). Requires some additional state (whether timeBeginPeriod has been called, and with what values) so timeEndPeriod can be called properly before each adjustment and when Python exits. Pro is the code is *relatively* simple and would mostly fix the problem. Cons are that it wouldn't be super discoverable (unless we put notes in every place that uses timeouts, not just in threading docs), it encourages bad behavior (one application deciding its needs are more important that conserving power), and we'd have to be *really* careful to pair our calls universally (timeEndPeriod must be called, even when other cleanup is skipped, such as when calling os._exit; AFAICT, the docs imply that per-process adjustments to the clock aren't undone even when the process completes, which means failure to pair all calls would leave the system with a suboptimal system clock resolution that would remain in effect until rebooted).

2. (Likely a terrible idea, and like option 1, should be explicitly opt-in, not enabled by default) Offer the option to have Python lock timeouts only use WaitForSingleObjectEx only to sleep to within one system clock tick of the target time (and not at all if the timeout is less than the clock resolution), then, before reacquiring the GIL, perform a time slice yielding busy loop until you pass the target time (as determined by a higher resolution clock than the system clock). Bad for power management, bad for single core machines (where even with time slice yielding, you're still constantly getting scheduled), etc.
msg324343 - (view) Author: Paul Overing (Gammaguy) Date: 2018-08-29 18:01
Thank you to everyone for looking at my issue esp Josh.r for the detailed response. I now understand the issue and I just wanted to bring it to the group's attention for everyone's benefit.  Vstinner is correct; Linux is my solution as it is my production environment but I was using my windows laptop for dev as I've been doing frequent Asia trips lately and noticed the difference.
msg324355 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-08-29 22:53
Changing the system timer resolution doesn't appear to help, given the current design combined with what looks like a bug in Windows (at least in Windows 10). We can set the system clock resolution below a millisecond via NtSetTimerResolution. I tested at 0.4882 ms, which allowed sleeping for 0.5 ms via NtDelayExectution. (NT interval resolution is 100 ns, nominally). But Python's threading Lock timeout is still limited to about 15-16 ms. I'd expect 1 ms resolution given we're ultimately calling WaitForSingleObjectEx via PyCOND_TIMEDWAIT -> _PyCOND_WAIT_MS.

It appears the problem is that GetTickCount[64] continues to increment by 15.625 ms (i.e. by 15 to 16), which looks like a bug to me, but may be a compatibility behavior. We use the tick count in EnterNonRecursiveMutex in the following block of code:

    } else if (milliseconds != 0) {
        /* wait at least until the target */
        DWORD now, target = GetTickCount() + milliseconds;
        while (mutex->locked) {
            if (PyCOND_TIMEDWAIT(&mutex->cv, &mutex->cs, (long long)milliseconds*1000) < 0) {
                result = WAIT_FAILED;
                break;
            }
            now = GetTickCount();
            if (target <= now)
                break;
            milliseconds = target-now;
        }
    }
msg324359 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-29 23:19
I dislike the idea of documentation the Windows clocks in the queue documentation, it's not the right place. I dislike the idea of documentating Windows "internals" in the Python documentation. You should see Python as a "thin wrapper" to the operating system. If you have an issue on a platform, you should look at the platform documentation. Python already provides time.get_clock_info('monotonic').

From what I read, changing the system clock resolution is a bad idea, since the Windows kernel is not designed for that, and it can reduce the battery life.

About Windows vs Linux, as I wrote, you are looking at the wrong way. queue.get() is not a bottleneck if you use threads, asyncio, multiprocessing or whatever.

Again, this issue is not a bug. And I dislike the idea of documentating the behaviour. So I still consider that the issue should be closed as "not a bug". But I leave it open since it seems like you want to use it to discuss Windows clocks ;-)
msg324396 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-08-30 17:51
Agreed with not putting platform-specific details everywhere, but in this case we can probably have a generic wording for "Blocks for `timeout` seconds as best as is available on the current operating system".

It is a general problem across all timeouts, so someone probably ought to do a full pass to find anywhere it matters. I'm not volunteering for that though!

About the only other possible fix is to figure out the current resolution, and replace any waits with shorter timeouts with a spin-lock. But in general I would rather leave this to the operating system and just document that we can't always overrule decisions made by the OS.
msg324491 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-09-03 04:25
Another possibility might be to not use the Windows timeout clock, at least not for short timeouts.  The following shows that tk does about 970 1 millesecond timeouts in 1 second (on my machine).
---
import tkinter as tk

root = tk.Tk()
cbid = None
cbcount = 0

def cb():
    global cbid, cbcount
    cbcount += 1
    cbid = root.after(1, cb)

def cbcancel():
    print(cbcount)
    root.after_cancel(cbid)

root.after(1000, cbcancel)
cbid = root.after(1, cb)
root.mainloop()
---

Here is a proof-of-concept queue-get-with-timeout function with sub-millesecond resolution.
---
import tkinter as tk
from queue import Queue, Empty
from time import perf_counter

q = Queue()
fails = 0
value = '<raise Empty>'

def qcb():
    try:
        global value
        value = q.get(block=False)
    except Empty:
        global fails
        fails += 1
        if perf_counter() < stop:
            root.after(0, qcb)
            return
    root.destroy()

def qget_to(timeout):
    global root, stop
    root = tk.Tk()
    root.withdraw()
    stop = perf_counter() + timeout
    qid = root.after(0, qcb)
    #q.put(1)
    root.mainloop()
    print('failures:', fails, 'value:', value)

qget_to(.001)
---

With the put commented out, there are 27 fails (on my machine).  When a value is already available, there are none.  The loop could be parameterized to repeatedly call the no-block version of any similar function.  Large enough time-outs could be partially fulfilled using the system timeout function.
History
Date User Action Args
2018-09-03 04:25:45terry.reedysetnosy: + terry.reedy
messages: + msg324491
2018-08-30 17:51:25steve.dowersetmessages: + msg324396
2018-08-29 23:19:15vstinnersetmessages: + msg324359
2018-08-29 22:53:19eryksunsetnosy: + eryksun
messages: + msg324355
2018-08-29 18:01:21Gammaguysetmessages: + msg324343
2018-08-29 17:44:19josh.rsetnosy: + josh.r
messages: + msg324342
2018-08-29 15:56:26vstinnersetmessages: + msg324328
2018-08-29 12:24:28Gammaguysetmessages: + msg324320
2018-08-29 08:30:04vstinnersetmessages: + msg324308
2018-08-29 04:14:44rhettingersetnosy: + rhettinger, vstinner
messages: + msg324299
2018-08-28 19:44:30xtreaksetnosy: + xtreak
2018-08-28 19:20:21Gammaguycreate