Use __builtin_clzl for bits_in_digit if available #73968

niklasf · 2017-03-10T10:53:57Z

BPO	29782
Nosy	@mdickinson, @pitrou, @vstinner, @serhiy-storchaka, @mlouielu, @niklasf
PRs	bpo-29782: Use __builtin_clzl for bits_in_digit if available #594 bpo-29782: Consolidate _Py_Bit_Length() #20739
Files	bench_bit_length.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-06-15.12:44:55.090>
created_at = <Date 2017-03-10.10:53:56.987>
labels = ['interpreter-core', '3.10', 'performance']
title = 'Use __builtin_clzl for bits_in_digit if available'
updated_at = <Date 2020-06-15.13:40:04.205>
user = 'https://github.com/niklasf'

bugs.python.org fields:

activity = <Date 2020-06-15.13:40:04.205>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2020-06-15.12:44:55.090>
closer = 'vstinner'
components = ['Interpreter Core']
creation = <Date 2017-03-10.10:53:56.987>
creator = 'niklasf'
dependencies = []
files = ['46759']
hgrepos = []
issue_num = 29782
keywords = []
message_count = 19.0
messages = ['289353', '289400', '289401', '289404', '289406', '289407', '290620', '290621', '290622', '290973', '292090', '292105', '292106', '292114', '292115', '292139', '371035', '371539', '371543']
nosy_count = 6.0
nosy_names = ['mark.dickinson', 'pitrou', 'vstinner', 'serhiy.storchaka', 'louielu', 'niklasf']
pr_nums = ['594', '20739']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue29782'
versions = ['Python 3.10']

niklasf · 2017-03-10T10:53:57Z

Baseline performance (9e6ac83):

$ ./python -m timeit "12345678 == 12345678.0"
5000000 loops, best of 5: 40 nsec per loop
$ ./python -m timeit "1 == 1.0"
10000000 loops, best of 5: 38.8 nsec per loop
$ ./python -m timeit "(1234578987654321).bit_length()"
10000000 loops, best of 5: 39.4 nsec per loop

Upcoming PR:

$ ./python -m timeit "12345678 == 12345678.0"
10000000 loops, best of 5: 34.3 nsec per loop
$ ./python -m timeit "1 == 1.0"
10000000 loops, best of 5: 34.4 nsec per loop
$ ./python -m timeit "(1234578987654321).bit_length()"
10000000 loops, best of 5: 36.4 nsec per loop

mdickinson · 2017-03-10T18:25:58Z

Thanks for the idea, and the PR!

To be useful, this would need a bit of tweaking: we can't assume that a digit is always unsigned long (in fact, I'd expect that to be rare on 64-bit non-Windows systems, where unsigned long typically has 64 bits, and digit should be using a 32-bit type), so we'd need to identify and use the most appropriate variant of clz/clzl/clzll somehow.

I think it would also make sense to move the detection of the existence of __builtin_clz and friends into the configuration machinery: these builtins aren't just restricted to GCC (clang supports them as well), and that would allow other platforms to provide their own substitutes by modifying pyport.h. Ideally, the configuration machinery #defines a HAVE_CLZ variable, and then in longobject.c we do an #ifdef HAVE_CLZ ...

We also need to trade-off the additional complication in the implementation against the benefits: though we do certainly care about the speed, speed at all costs isn't the target here; readability, portability and maintainability of the source counts, too. It'll probably be easier to weigh those factors once there's an implementation that addresses the above points.

serhiy-storchaka · 2017-03-10T18:52:28Z

I think we can assume that digit is no larger than unsigned long, otherwise PyLong_AsLong() and like wouldn't work.

mdickinson · 2017-03-10T19:50:00Z

True enough. It looks as though someone (cough) already documented that restriction, too: https://github.com/mdickinson/cpython/blob/v3.6.0/Include/longintrepr.h#L28-L30

niklasf · 2017-03-10T20:09:01Z

Thanks for the review.

I pushed a change to check if clz can be used (sizeof(digit) <= sizeof(unsigned int)). Otherwise use clzl. I believe the latter should be the most common, since unsigned long has 32bits. As you say unsigned long long should never be needed.

Btw. mathmodule.c currently duplicates the function: https://github.com/python/cpython/blob/master/Modules/mathmodule.c#L1317. It might be worth factoring it out, but I don't know where to put it.

niklasf · 2017-03-10T20:12:03Z

Oops. Actually clz should commonly be enough. And platforms where clz and clzl are different (<=> unsigned int and unsigned long are different) should be rare.

vstinner · 2017-03-27T15:36:58Z

The change is an optimization, so it requires a benchmark, think you provided. Ok. But the speedup is a few nanoseconds on a function which takes currently 40 nanoseconds. It's not really what I would call significant:

$ ./python -m perf compare_to ref.json patch.json  --table
+---------------------+---------+-----------------------------+
| Benchmark           | ref     | patch                       |
+=====================+=========+=============================+
| int==float 8 digits | 39.2 ns | 37.9 ns: 1.03x faster (-3%) |
+---------------------+---------+-----------------------------+
| int==float 1 digit  | 38.0 ns | 37.9 ns: 1.00x faster (-0%) |
+---------------------+---------+-----------------------------+
| int.bit_length()    | 42.0 ns | 41.2 ns: 1.02x faster (-2%) |
+---------------------+---------+-----------------------------+

(See attached bench_bit_length.py script.)

So I'm not really convinced that the change is useful. Is bit_length() used in hot loops?

vstinner · 2017-03-27T15:41:13Z

$ ./python -m timeit "12345678 == 12345678.0"
5000000 loops, best of 5: 40 nsec per loop

By the way, I check the bytecode to make sure that the compiler doesn't optimize that. I'm suprised that it's not replaced with True!

Is there a reason to perform the test at runtime?

serhiy-storchaka · 2017-03-27T15:45:40Z

There are not good reasons to optimize this case at compile time. This is very obscure way of writing True.

pitrou · 2017-04-01T09:26:05Z

I don't think a 3% improvement on a micro-benchmark is worth it (this will translate into 0% improvement on real-world workloads). Are there other uses of _Py_bit_length() that show bigger benefits?

vstinner · 2017-04-21T23:58:49Z

I concur with Antoine and suggest to reject this issue.

serhiy-storchaka · 2017-04-22T05:32:49Z

In some microbenchmarks this can give up to 15%.

$ ./python.patched -m perf timeit -q --compare-to=./python.default -s "a = list(map(float, range(10000)))" "12345 in a"
Mean +- std dev: [python.default] 1.28 ms +- 0.11 ms -> [python.patched] 1.12 ms +- 0.07 ms: 1.15x faster (-13%)

serhiy-storchaka · 2017-04-22T05:40:58Z

But even 15% speed up in particular microbenchmarks looks too small to me for such complex change.

mdickinson · 2017-04-22T12:14:12Z

Closing.

mlouielu · 2017-04-22T12:19:46Z

If this issue is closed by "not a big performance improvement", maybe the XXX in mathmoudle.c should be take off?

"""
/* XXX: This routine does more or less the same thing as

bits_in_digit() in Objects/longobject.c. Someday it would be nice to
consolidate them. On BSD, there's a library function called fls()
that we could use, and GCC provides __builtin_clz().
*/
"""

vstinner · 2017-04-22T21:57:15Z

I'm in favor of documenting in comments decisions to not micro-optimize
such code. I did something similar in ceval.c for 1+1.

vstinner · 2020-06-08T18:24:27Z

I reopen the issue since PR 20739 was created.

vstinner · 2020-06-15T12:33:52Z

New changeset 794e7d1 by Niklas Fiekas in branch 'master':
bpo-29782: Consolidate _Py_Bit_Length() (GH-20739)
794e7d1

vstinner · 2020-06-15T13:40:04Z

Thanks Niklas Fiekas for your tenacity :-)

niklasf mannequin added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Mar 10, 2017

mdickinson closed this as completed Apr 22, 2017

vstinner reopened this Jun 8, 2020

vstinner added 3.10 only security fixes and removed 3.7 (EOL) end of life labels Jun 15, 2020

vstinner closed this as completed Jun 15, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use __builtin_clzl for bits_in_digit if available #73968

Use __builtin_clzl for bits_in_digit if available #73968

niklasf mannequin commented Mar 10, 2017

niklasf mannequin commented Mar 10, 2017

mdickinson commented Mar 10, 2017

serhiy-storchaka commented Mar 10, 2017

mdickinson commented Mar 10, 2017

niklasf mannequin commented Mar 10, 2017

niklasf mannequin commented Mar 10, 2017

vstinner commented Mar 27, 2017

vstinner commented Mar 27, 2017

serhiy-storchaka commented Mar 27, 2017

pitrou commented Apr 1, 2017

vstinner commented Apr 21, 2017

serhiy-storchaka commented Apr 22, 2017

serhiy-storchaka commented Apr 22, 2017

mdickinson commented Apr 22, 2017

mlouielu mannequin commented Apr 22, 2017

vstinner commented Apr 22, 2017

vstinner commented Jun 8, 2020

vstinner commented Jun 15, 2020

vstinner commented Jun 15, 2020

Navigation Menu

Use __builtin_clzl for bits_in_digit if available #73968

Use __builtin_clzl for bits_in_digit if available #73968

Comments

niklasf mannequin commented Mar 10, 2017

niklasf mannequin commented Mar 10, 2017

mdickinson commented Mar 10, 2017

serhiy-storchaka commented Mar 10, 2017

mdickinson commented Mar 10, 2017

niklasf mannequin commented Mar 10, 2017

niklasf mannequin commented Mar 10, 2017

vstinner commented Mar 27, 2017

vstinner commented Mar 27, 2017

serhiy-storchaka commented Mar 27, 2017

pitrou commented Apr 1, 2017

vstinner commented Apr 21, 2017

serhiy-storchaka commented Apr 22, 2017

serhiy-storchaka commented Apr 22, 2017

mdickinson commented Apr 22, 2017

mlouielu mannequin commented Apr 22, 2017

vstinner commented Apr 22, 2017

vstinner commented Jun 8, 2020

vstinner commented Jun 15, 2020

vstinner commented Jun 15, 2020