timeit accuracy could be better #67881

rbtcollins · 2015-03-17T22:34:38Z

BPO	23693
Nosy	@tim-one, @vstinner, @rbtcollins, @serhiy-storchaka

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-11-03.01:59:53.545>
created_at = <Date 2015-03-17.22:34:37.934>
labels = ['type-feature', 'library']
title = 'timeit accuracy could be better'
updated_at = <Date 2016-11-03.01:59:53.473>
user = 'https://github.com/rbtcollins'

bugs.python.org fields:

activity = <Date 2016-11-03.01:59:53.473>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2016-11-03.01:59:53.545>
closer = 'vstinner'
components = ['Library (Lib)']
creation = <Date 2015-03-17.22:34:37.934>
creator = 'rbcollins'
dependencies = []
files = []
hgrepos = []
issue_num = 23693
keywords = []
message_count = 5.0
messages = ['238353', '238361', '238364', '268067', '279959']
nosy_count = 4.0
nosy_names = ['tim.peters', 'vstinner', 'rbcollins', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'third party'
stage = 'needs patch'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue23693'
versions = ['Python 3.6']

rbtcollins · 2015-03-17T22:34:38Z

In bpo-6422 Haypo suggested making the timeit reports much better. This is a new ticket just for that. See https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py and http://bugs.python.org/issue6422?@ok_message=issue%206422%20nosy%2C%20nosy_count%2C%20stage%20edited%20ok&@template=item#msg164216

serhiy-storchaka · 2015-03-17T23:29:50Z

See also bpo-21988.

vstinner · 2015-03-17T23:53:28Z

Not only I'm too lazy to compute manually the number of loops and repeat, but also I don't trust myself. It's even worse when someone publishs results of a micro-benchmark. I don't trust how the benchmark was calibrated. In my experience, micro-benchmark are polluted by noise in timings, so results are not reliable.

benchmarks.py calibration is based on time, whereas timeit uses hardcoded constants (loops=1000000, repeat=3) which can be modified on the command line.

benchmarks.py has 3 main parameters:

minimum duration of a single run (--min-time): 100 ms by default
maximum total duration of the benchmark: benchmark.py does its best to respect this duration, but it can be longer: 1 second by default
minimum repeat: 5 by default

The minimum duration is increased if the clock resolution is bad (1 ms or more). It's the case on Windows for time.clock() on Python 2 for example. Extract of benchmark.py:

    min_time = max(self.config.min_time, timer_precision * 100)

The estimation of the number of loops is not reliable, but it's written to be "fast". Since I run a micro-benchmark many times, I don't want to wait too long. It's not a power of 10, but an arbitrary integer number. Usually, when running benchmark.py multiple times, the number of loops is different each time. It's not really a big issue, but it probably makes results more difficult to compare.

My constrain is max_time. The tested function may not have a linear duration (time = time_one_iteration * loops).

https://bitbucket.org/haypo/misc/src/348bfd6108e9985b3c2298d2745eb5ddfe7042e6/python/benchmark.py?at=default#cl-416

Repeat a test at least 5 times is a compromise between the stability of the result and the total duration of the benchmark.

Feel free to reuse my code to enhance time.py.

vstinner · 2016-06-09T23:07:33Z

Hi,

I develop a new implementation of timeit which should be more reliable:
http://perf.readthedocs.io/en/latest/

Run 25 processes instead of just 1
Compute average and standard deviation rather than the minimum
Don't disable the garbage collector
Skip the first timing to "warmup" the benchmark

Using the minimum and disable the garbage collector is a bad practice, it is not reliable:

multiple processes are need to test different random hash functions, since Python hash function is now randomized by default in Python 3
Linux also randomizes the address space by default (ASLR) and so the exact timing of memory accesses is different in each process

My following blog post "My journey to stable benchmark, part 3 (average)" explains in depth the multiple issues of using the minimum:
https://haypo.github.io/journey-to-stable-benchmark-average.html

My perf module is very yound, it's still a work-in-progress. It should be better than timeit right now. It works on Python 2.7 and 3 (I tested 3.4).

We may pick the best ideas into the timeit module.

See also my article explaining how to tune Linux to reduce the "noise" of the operating system on microbenchmarks:
https://haypo.github.io/journey-to-stable-benchmark-system.html

vstinner · 2016-11-03T01:59:53Z

I wrote a whole new project "perf" to fix root issues of this issue. It includes a timeit command. I suggest you to use "perf timeit" rather than "timeit" because perf is more reliable:
http://perf.readthedocs.io/en/latest/cli.html#timeit

rbtcollins added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Mar 17, 2015

vstinner closed this as completed Nov 3, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeit accuracy could be better #67881

timeit accuracy could be better #67881

rbtcollins commented Mar 17, 2015

rbtcollins commented Mar 17, 2015

serhiy-storchaka commented Mar 17, 2015

vstinner commented Mar 17, 2015

vstinner commented Jun 9, 2016

vstinner commented Nov 3, 2016

timeit accuracy could be better #67881

timeit accuracy could be better #67881

Comments

rbtcollins commented Mar 17, 2015

rbtcollins commented Mar 17, 2015

serhiy-storchaka commented Mar 17, 2015

vstinner commented Mar 17, 2015

vstinner commented Jun 9, 2016

vstinner commented Nov 3, 2016