Issue34535
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2018-08-28 19:20 by Gammaguy, last changed 2022-04-11 14:59 by admin.
Messages (14) | |||
---|---|---|---|
msg324277 - (view) | Author: Paul Overing (Gammaguy) | Date: 2018-08-28 19:20 | |
I have found that using timeout with a python queue.Queue() results in an average delay of 14.5ms in Windows over Ubuntu in python 3.6.5/3.7, I subtracted the 1 ms timeout. I have also tried datetime.datetime.now() to measure the time, with similar results. Does that make sense? import queue import time max=0 min=10000 num = 10000 sum = 0 for x in range(10000): q = queue.Queue() start = time.perf_counter() try: msg = q.get(block=True, timeout=.001) except: end = time.perf_counter() chng = end-start -.001 if chng > max: max = chng if chng < min: min = chng sum = sum + chng print(f"Avg: {sum/num}") print(f"Min: {min}") print(f"Max: {max}") Results: Window 10, Python 3.6.5/3.7 Avg: 0.014549868429997701 Min: 0.001674142000000046 Max: 0.13890251500000284 Ubuntu 16.04, Python 3.6.5/3.7 Avg: 6.275181290839012e-05 Min: 4.89290034456644e-05 Max: 8.690999695681965e-05 |
|||
msg324299 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2018-08-29 04:14 | |
Victor or Steve, do you know the resolution of perf_counter() and monotonic() on Windows? |
|||
msg324308 - (view) | Author: STINNER Victor (vstinner) * | Date: 2018-08-29 08:30 | |
> Victor or Steve, do you know the resolution of perf_counter() and monotonic() on Windows? You can use: >>> time.get_clock_info('perf_counter') namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) perf_counter() has usually a resolution much smaller than 1 ms (ex: 100 ns), whereas monotonic() has a bad resolution (around 15.6 ms). Rounding matters for the monotonic clock on Windows :-) Python should round away from zero (ROUND_UP) for timeouts. -- > I have found that using timeout with a python queue.Queue() results in an average delay of 14.5ms in Windows over Ubuntu in python 3.6.5/3.7, I subtracted the 1 ms timeout. This behaviour looks correct to me. I suggest to close the issue. Python rely on Windows API. Queue.get() uses a threading.Lock. On Windows, lock.acquire(timeout=timeout) uses EnterNonRecursiveMutex() which expects a number of a milliseconds. Even if you succeed to enhance the resolution of the system clock, you will still be stuck to a resolution of 1 ms because of the Windows API. I'm not sure of what is your underlying question. Does it look inefficient to you? You can use threads or asyncio to write more efficient code. But this is out of the scope of the bug tracker :-) I suggest to close this issue as "not a bug". |
|||
msg324320 - (view) | Author: Paul Overing (Gammaguy) | Date: 2018-08-29 12:24 | |
Thank you for your quick response. But 1ms resolution or not, 14.5ms does not seem reasonable given that in Ubuntu it is 0.049ms. Ubuntu is on average 295x faster. Also the variability in times is much higher in Windows max/avg = 9.547, avg/min = 8.69. It makes the windows queue.Queue implementations with timeouts unusable in low latency environments and should be documented as such. As it is slow and highly variable. |
|||
msg324328 - (view) | Author: STINNER Victor (vstinner) * | Date: 2018-08-29 15:56 | |
> It makes the windows queue.Queue implementations with timeouts unusable in low latency environments and should be documented as such. As it is slow and highly variable. Well, it seems like you have a solution which is called Linux, no? |
|||
msg324342 - (view) | Author: Josh Rosenberg (josh.r) * | Date: 2018-08-29 17:44 | |
Victor, that was a little overboard. By that logic, there doesn't need to be a Windows version of Python. That said, Paul doesn't seem to understand that the real resolution limit isn't 1 ms; that's the lower limit on arguments to the API, but the real limit is the system clock, which has a granularity in the 10-16 ms range. It's a problem with Windows in general, and the cure is worse than the disease. Per https://msdn.microsoft.com/en-us/library/windows/desktop/ms724411(v=vs.85).aspx , the resolution of the system timer is typically in the range of 10 milliseconds to 16 milliseconds. Per https://docs.microsoft.com/en-us/windows/desktop/Sync/wait-functions#wait-functions-and-time-out-intervals : > Wait Functions and Time-out Intervals > The accuracy of the specified time-out interval depends on the resolution of the system clock. The system clock "ticks" at a constant rate. If the time-out interval is less than the resolution of the system clock, the wait may time out in less than the specified length of time. If the time-out interval is greater than one tick but less than two, the wait can be anywhere between one and two ticks, and so on. All the Windows synchronization primitives (e.g. WaitForSingleObjectEx https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-waitforsingleobjectex , which is what ultimately implements timed lock acquisition on Windows) are based on the system clock, so without drastic measures, it's impossible to get better granularity than the 10-16 ms of the default system clock configuration. The link on "Wait Functions and Time-out Intervals" does mention that this granularity *can* be increased, but it recommends against fine-grained tuning (so you can't just tweak it before a wait and undo the tweak after; the only safe thing to do is change it on program launch and undo it on program exit). Even then, it's a bad idea for Python to use it; per timeBeginPeriod's own docs ( https://docs.microsoft.com/en-us/windows/desktop/api/timeapi/nf-timeapi-timebeginperiod ): > This function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process. Setting a higher resolution can improve the accuracy of time-out intervals in wait functions. However, it can also reduce overall system performance, because the thread scheduler switches tasks more often. High resolutions can also prevent the CPU power management system from entering power-saving modes. Setting a higher resolution does not improve the accuracy of the high-resolution performance counter. Basically, to improve the resolution of timed lock acquisition, we'd have to change the performance profile of the entire OS while Python was running, likely increasing power usage and possibly reducing performance. Global solutions to local problems are a bad idea. The most reasonable solution to the problem is to simply document it (maybe not for queue.Queue, but for the threading module). Possibly even provide an attribute in the threading module similar to threading.TIMEOUT_MAX that reports the system clock's granularity for informational purposes (might need to be a function so it reports the potentially changing granularity). Other, less reasonable solutions, would be: 1. Expose a function (with prominent warnings about not using it in a fine grained manner, and the effects on power management and performance) that would increase the system clock granularity as much as possible timeGetDevCaps reports possible (possibly limited to a user provided suggestion, so while the clock could go to 1 ms resolution, the user could request only 5 ms resolution to reduce the costs of doing so). Requires some additional state (whether timeBeginPeriod has been called, and with what values) so timeEndPeriod can be called properly before each adjustment and when Python exits. Pro is the code is *relatively* simple and would mostly fix the problem. Cons are that it wouldn't be super discoverable (unless we put notes in every place that uses timeouts, not just in threading docs), it encourages bad behavior (one application deciding its needs are more important that conserving power), and we'd have to be *really* careful to pair our calls universally (timeEndPeriod must be called, even when other cleanup is skipped, such as when calling os._exit; AFAICT, the docs imply that per-process adjustments to the clock aren't undone even when the process completes, which means failure to pair all calls would leave the system with a suboptimal system clock resolution that would remain in effect until rebooted). 2. (Likely a terrible idea, and like option 1, should be explicitly opt-in, not enabled by default) Offer the option to have Python lock timeouts only use WaitForSingleObjectEx only to sleep to within one system clock tick of the target time (and not at all if the timeout is less than the clock resolution), then, before reacquiring the GIL, perform a time slice yielding busy loop until you pass the target time (as determined by a higher resolution clock than the system clock). Bad for power management, bad for single core machines (where even with time slice yielding, you're still constantly getting scheduled), etc. |
|||
msg324343 - (view) | Author: Paul Overing (Gammaguy) | Date: 2018-08-29 18:01 | |
Thank you to everyone for looking at my issue esp Josh.r for the detailed response. I now understand the issue and I just wanted to bring it to the group's attention for everyone's benefit. Vstinner is correct; Linux is my solution as it is my production environment but I was using my windows laptop for dev as I've been doing frequent Asia trips lately and noticed the difference. |
|||
msg324355 - (view) | Author: Eryk Sun (eryksun) * | Date: 2018-08-29 22:53 | |
Changing the system timer resolution doesn't appear to help, given the current design combined with what looks like a bug in Windows (at least in Windows 10). We can set the system clock resolution below a millisecond via NtSetTimerResolution. I tested at 0.4882 ms, which allowed sleeping for 0.5 ms via NtDelayExectution. (NT interval resolution is 100 ns, nominally). But Python's threading Lock timeout is still limited to about 15-16 ms. I'd expect 1 ms resolution given we're ultimately calling WaitForSingleObjectEx via PyCOND_TIMEDWAIT -> _PyCOND_WAIT_MS. It appears the problem is that GetTickCount[64] continues to increment by 15.625 ms (i.e. by 15 to 16), which looks like a bug to me, but may be a compatibility behavior. We use the tick count in EnterNonRecursiveMutex in the following block of code: } else if (milliseconds != 0) { /* wait at least until the target */ DWORD now, target = GetTickCount() + milliseconds; while (mutex->locked) { if (PyCOND_TIMEDWAIT(&mutex->cv, &mutex->cs, (long long)milliseconds*1000) < 0) { result = WAIT_FAILED; break; } now = GetTickCount(); if (target <= now) break; milliseconds = target-now; } } |
|||
msg324359 - (view) | Author: STINNER Victor (vstinner) * | Date: 2018-08-29 23:19 | |
I dislike the idea of documentation the Windows clocks in the queue documentation, it's not the right place. I dislike the idea of documentating Windows "internals" in the Python documentation. You should see Python as a "thin wrapper" to the operating system. If you have an issue on a platform, you should look at the platform documentation. Python already provides time.get_clock_info('monotonic'). From what I read, changing the system clock resolution is a bad idea, since the Windows kernel is not designed for that, and it can reduce the battery life. About Windows vs Linux, as I wrote, you are looking at the wrong way. queue.get() is not a bottleneck if you use threads, asyncio, multiprocessing or whatever. Again, this issue is not a bug. And I dislike the idea of documentating the behaviour. So I still consider that the issue should be closed as "not a bug". But I leave it open since it seems like you want to use it to discuss Windows clocks ;-) |
|||
msg324396 - (view) | Author: Steve Dower (steve.dower) * | Date: 2018-08-30 17:51 | |
Agreed with not putting platform-specific details everywhere, but in this case we can probably have a generic wording for "Blocks for `timeout` seconds as best as is available on the current operating system". It is a general problem across all timeouts, so someone probably ought to do a full pass to find anywhere it matters. I'm not volunteering for that though! About the only other possible fix is to figure out the current resolution, and replace any waits with shorter timeouts with a spin-lock. But in general I would rather leave this to the operating system and just document that we can't always overrule decisions made by the OS. |
|||
msg324491 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2018-09-03 04:25 | |
Another possibility might be to not use the Windows timeout clock, at least not for short timeouts. The following shows that tk does about 970 1 millesecond timeouts in 1 second (on my machine). --- import tkinter as tk root = tk.Tk() cbid = None cbcount = 0 def cb(): global cbid, cbcount cbcount += 1 cbid = root.after(1, cb) def cbcancel(): print(cbcount) root.after_cancel(cbid) root.after(1000, cbcancel) cbid = root.after(1, cb) root.mainloop() --- Here is a proof-of-concept queue-get-with-timeout function with sub-millesecond resolution. --- import tkinter as tk from queue import Queue, Empty from time import perf_counter q = Queue() fails = 0 value = '<raise Empty>' def qcb(): try: global value value = q.get(block=False) except Empty: global fails fails += 1 if perf_counter() < stop: root.after(0, qcb) return root.destroy() def qget_to(timeout): global root, stop root = tk.Tk() root.withdraw() stop = perf_counter() + timeout qid = root.after(0, qcb) #q.put(1) root.mainloop() print('failures:', fails, 'value:', value) qget_to(.001) --- With the put commented out, there are 27 fails (on my machine). When a value is already available, there are none. The loop could be parameterized to repeatedly call the no-block version of any similar function. Large enough time-outs could be partially fulfilled using the system timeout function. |
|||
msg388948 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-17 17:33 | |
See the related discussion in bpo-41299. The system interrupt period can be lowered to 1 ms with timeBeginPeriod() -- or lower with undocumented NtSetTimerResolution(). This affects the resolution of dispatcher waits such as Sleep() and WaitForSingleObject(), as well as the resolution of Query[Unbiased]InterruptTime(). But apparently GetTickCount[64]() has a fixed update frequency of 64 ticks/second (15.625 ms period), regardless of the system interrupt period. The combination of WaitForSingleObject() with GetTickCount64() has marginally better resolution and consistent performance if the system interrupt period is lowered to 1 ms. If a wait needs to be resumable in order to handle SIGINT, and it should perform consistently with other waits, then there are two choices. One option is to base the timeout on Query[Unbiased]InterruptTime(), for which the resolution depends on the system interrupt period. Conversely, if we want all waits to perform consistently and independent of the system interrupt period, then they should all depend on GetTickCount64(). This could be implemented in one place, such as Modules/signalmodule.c, with the addition of _Py_Sleep(), Py_WaitForSingleObject(), and _Py_WaitForMultipleObjects(). On the main thread, these wait functions would include the SIGINT event and resume the wait if the SIGINT handler doesn't raise an exception. |
|||
msg388950 - (view) | Author: Steve Dower (steve.dower) * | Date: 2021-03-17 17:44 | |
Given that extra info, I'd say we're fine to document that our timeouts can't do any better than the OS, which "for example, is typically around 15ms on Windows", and recommend using non-blocking calls instead. |
|||
msg388961 - (view) | Author: Eryk Sun (eryksun) * | Date: 2021-03-17 19:29 | |
> Given that extra info, I'd say we're fine to document that our timeouts > can't do any better than the OS, which "for example, is typically > around 15ms on Windows", and recommend using non-blocking calls > instead. The 15.625 ms resolution limit is fine, as long as performance is predictable. I don't like the random inconsistency introduced by extending only certain waits, in different ways, to support SIGINT and/or waits longer than 49.7 days. For example, time.sleep() doesn't ignore WAIT_TIMEOUT to recompute the remaining time, so it's not subject to the resolution limit that's imposed by GetTickCount64(). I'd prefer a common implementation of _Py_Sleep, _Py_WaitForSingleObject, and _Py_WaitForMultiple objects in order to be able to definitively state that all wait timeouts are unconditionally limited to the resolution reported by time.get_clock_info('monotonic').resolution; are not limited to 49.7 days; and can be interrupted by Ctrl+C in the main thread -- except for waiting on I/O. (There's an open issue to enable Ctrl+C to cancel synchronous I/O in the main thread -- such as reading from a pipe.) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:05 | admin | set | github: 78716 |
2021-03-18 02:39:28 | rhettinger | set | assignee: docs@python type: performance -> behavior components: + Documentation, - Windows nosy: + docs@python |
2021-03-17 19:35:38 | vstinner | set | nosy:
- vstinner |
2021-03-17 19:29:15 | eryksun | set | messages: + msg388961 |
2021-03-17 17:44:41 | steve.dower | set | messages: + msg388950 |
2021-03-17 17:33:36 | eryksun | set | messages:
+ msg388948 versions: + Python 3.8, Python 3.9, Python 3.10, - Python 3.6, Python 3.7 |
2018-09-03 04:25:45 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg324491 |
2018-08-30 17:51:25 | steve.dower | set | messages: + msg324396 |
2018-08-29 23:19:15 | vstinner | set | messages: + msg324359 |
2018-08-29 22:53:19 | eryksun | set | nosy:
+ eryksun messages: + msg324355 |
2018-08-29 18:01:21 | Gammaguy | set | messages: + msg324343 |
2018-08-29 17:44:19 | josh.r | set | nosy:
+ josh.r messages: + msg324342 |
2018-08-29 15:56:26 | vstinner | set | messages: + msg324328 |
2018-08-29 12:24:28 | Gammaguy | set | messages: + msg324320 |
2018-08-29 08:30:04 | vstinner | set | messages: + msg324308 |
2018-08-29 04:14:44 | rhettinger | set | nosy:
+ rhettinger, vstinner messages: + msg324299 |
2018-08-28 19:44:30 | xtreak | set | nosy:
+ xtreak |
2018-08-28 19:20:21 | Gammaguy | create |