Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use __builtin_clzl for bits_in_digit if available #73968

Closed
niklasf mannequin opened this issue Mar 10, 2017 · 19 comments
Closed

Use __builtin_clzl for bits_in_digit if available #73968

niklasf mannequin opened this issue Mar 10, 2017 · 19 comments
Labels
3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@niklasf
Copy link
Mannequin

niklasf mannequin commented Mar 10, 2017

BPO 29782
Nosy @mdickinson, @pitrou, @vstinner, @serhiy-storchaka, @mlouielu, @niklasf
PRs
  • bpo-29782: Use __builtin_clzl for bits_in_digit if available #594
  • bpo-29782: Consolidate _Py_Bit_Length() #20739
  • Files
  • bench_bit_length.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-06-15.12:44:55.090>
    created_at = <Date 2017-03-10.10:53:56.987>
    labels = ['interpreter-core', '3.10', 'performance']
    title = 'Use __builtin_clzl for bits_in_digit if available'
    updated_at = <Date 2020-06-15.13:40:04.205>
    user = 'https://github.com/niklasf'

    bugs.python.org fields:

    activity = <Date 2020-06-15.13:40:04.205>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-06-15.12:44:55.090>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2017-03-10.10:53:56.987>
    creator = 'niklasf'
    dependencies = []
    files = ['46759']
    hgrepos = []
    issue_num = 29782
    keywords = []
    message_count = 19.0
    messages = ['289353', '289400', '289401', '289404', '289406', '289407', '290620', '290621', '290622', '290973', '292090', '292105', '292106', '292114', '292115', '292139', '371035', '371539', '371543']
    nosy_count = 6.0
    nosy_names = ['mark.dickinson', 'pitrou', 'vstinner', 'serhiy.storchaka', 'louielu', 'niklasf']
    pr_nums = ['594', '20739']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue29782'
    versions = ['Python 3.10']

    @niklasf
    Copy link
    Mannequin Author

    niklasf mannequin commented Mar 10, 2017

    Baseline performance (9e6ac83):

    $ ./python -m timeit "12345678 == 12345678.0"
    5000000 loops, best of 5: 40 nsec per loop
    $ ./python -m timeit "1 == 1.0"
    10000000 loops, best of 5: 38.8 nsec per loop
    $ ./python -m timeit "(1234578987654321).bit_length()"
    10000000 loops, best of 5: 39.4 nsec per loop

    Upcoming PR:

    $ ./python -m timeit "12345678 == 12345678.0"
    10000000 loops, best of 5: 34.3 nsec per loop
    $ ./python -m timeit "1 == 1.0"
    10000000 loops, best of 5: 34.4 nsec per loop
    $ ./python -m timeit "(1234578987654321).bit_length()"
    10000000 loops, best of 5: 36.4 nsec per loop

    @niklasf niklasf mannequin added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Mar 10, 2017
    @mdickinson
    Copy link
    Member

    Thanks for the idea, and the PR!

    To be useful, this would need a bit of tweaking: we can't assume that a digit is always unsigned long (in fact, I'd expect that to be rare on 64-bit non-Windows systems, where unsigned long typically has 64 bits, and digit should be using a 32-bit type), so we'd need to identify and use the most appropriate variant of clz/clzl/clzll somehow.

    I think it would also make sense to move the detection of the existence of __builtin_clz and friends into the configuration machinery: these builtins aren't just restricted to GCC (clang supports them as well), and that would allow other platforms to provide their own substitutes by modifying pyport.h. Ideally, the configuration machinery #defines a HAVE_CLZ variable, and then in longobject.c we do an #ifdef HAVE_CLZ ...

    We also need to trade-off the additional complication in the implementation against the benefits: though we do certainly care about the speed, speed at all costs isn't the target here; readability, portability and maintainability of the source counts, too. It'll probably be easier to weigh those factors once there's an implementation that addresses the above points.

    @serhiy-storchaka
    Copy link
    Member

    I think we can assume that digit is no larger than unsigned long, otherwise PyLong_AsLong() and like wouldn't work.

    @mdickinson
    Copy link
    Member

    True enough. It looks as though someone (cough) already documented that restriction, too: https://github.com/mdickinson/cpython/blob/v3.6.0/Include/longintrepr.h#L28-L30

    @niklasf
    Copy link
    Mannequin Author

    niklasf mannequin commented Mar 10, 2017

    Thanks for the review.

    I pushed a change to check if clz can be used (sizeof(digit) <= sizeof(unsigned int)). Otherwise use clzl. I believe the latter should be the most common, since unsigned long has 32bits. As you say unsigned long long should never be needed.

    Btw. mathmodule.c currently duplicates the function: https://github.com/python/cpython/blob/master/Modules/mathmodule.c#L1317. It might be worth factoring it out, but I don't know where to put it.

    @niklasf
    Copy link
    Mannequin Author

    niklasf mannequin commented Mar 10, 2017

    Oops. Actually clz should commonly be enough. And platforms where clz and clzl are different (<=> unsigned int and unsigned long are different) should be rare.

    @vstinner
    Copy link
    Member

    The change is an optimization, so it requires a benchmark, think you provided. Ok. But the speedup is a few nanoseconds on a function which takes currently 40 nanoseconds. It's not really what I would call significant:

    $ ./python -m perf compare_to ref.json patch.json  --table
    +---------------------+---------+-----------------------------+
    | Benchmark           | ref     | patch                       |
    +=====================+=========+=============================+
    | int==float 8 digits | 39.2 ns | 37.9 ns: 1.03x faster (-3%) |
    +---------------------+---------+-----------------------------+
    | int==float 1 digit  | 38.0 ns | 37.9 ns: 1.00x faster (-0%) |
    +---------------------+---------+-----------------------------+
    | int.bit_length()    | 42.0 ns | 41.2 ns: 1.02x faster (-2%) |
    +---------------------+---------+-----------------------------+

    (See attached bench_bit_length.py script.)

    So I'm not really convinced that the change is useful. Is bit_length() used in hot loops?

    @vstinner
    Copy link
    Member

    $ ./python -m timeit "12345678 == 12345678.0"
    5000000 loops, best of 5: 40 nsec per loop

    By the way, I check the bytecode to make sure that the compiler doesn't optimize that. I'm suprised that it's not replaced with True!

    Is there a reason to perform the test at runtime?

    @serhiy-storchaka
    Copy link
    Member

    There are not good reasons to optimize this case at compile time. This is very obscure way of writing True.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 1, 2017

    I don't think a 3% improvement on a micro-benchmark is worth it (this will translate into 0% improvement on real-world workloads). Are there other uses of _Py_bit_length() that show bigger benefits?

    @vstinner
    Copy link
    Member

    I concur with Antoine and suggest to reject this issue.

    @serhiy-storchaka
    Copy link
    Member

    In some microbenchmarks this can give up to 15%.

    $ ./python.patched -m perf timeit -q --compare-to=./python.default -s "a = list(map(float, range(10000)))" "12345 in a"
    Mean +- std dev: [python.default] 1.28 ms +- 0.11 ms -> [python.patched] 1.12 ms +- 0.07 ms: 1.15x faster (-13%)

    @serhiy-storchaka
    Copy link
    Member

    But even 15% speed up in particular microbenchmarks looks too small to me for such complex change.

    @mdickinson
    Copy link
    Member

    Closing.

    @mlouielu
    Copy link
    Mannequin

    mlouielu mannequin commented Apr 22, 2017

    If this issue is closed by "not a big performance improvement", maybe the XXX in mathmoudle.c should be take off?

    """
    /* XXX: This routine does more or less the same thing as

    • bits_in_digit() in Objects/longobject.c. Someday it would be nice to
    • consolidate them. On BSD, there's a library function called fls()
    • that we could use, and GCC provides __builtin_clz().
      */
      """

    @vstinner
    Copy link
    Member

    I'm in favor of documenting in comments decisions to not micro-optimize
    such code. I did something similar in ceval.c for 1+1.

    @vstinner
    Copy link
    Member

    vstinner commented Jun 8, 2020

    I reopen the issue since PR 20739 was created.

    @vstinner vstinner reopened this Jun 8, 2020
    @vstinner
    Copy link
    Member

    New changeset 794e7d1 by Niklas Fiekas in branch 'master':
    bpo-29782: Consolidate _Py_Bit_Length() (GH-20739)
    794e7d1

    @vstinner vstinner added 3.10 only security fixes and removed 3.7 (EOL) end of life labels Jun 15, 2020
    @vstinner
    Copy link
    Member

    Thanks Niklas Fiekas for your tenacity :-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants