Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro optimization for range.index if step is 1 #86068

Closed
corona10 opened this issue Oct 1, 2020 · 18 comments
Closed

Micro optimization for range.index if step is 1 #86068

corona10 opened this issue Oct 1, 2020 · 18 comments
Assignees
Labels
3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@corona10
Copy link
Member

corona10 commented Oct 1, 2020

BPO 41902
Nosy @mdickinson, @vstinner, @methane, @serhiy-storchaka, @ammaraskar, @corona10, @pablogsal, @shihai1991
PRs
  • bpo-41902: Micro optimization for range.index if step is 1 #22479
  • bpo-41902: Add fast path for long_div if the divisor is one #22480
  • bpo-41902: Micro optimization for compute_item of range #22492
  • Files
  • bench_range_index.py
  • bench_range_step.py
  • bench_range_compute_item.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/corona10'
    closed_at = <Date 2020-10-21.02:39:25.414>
    created_at = <Date 2020-10-01.13:42:45.832>
    labels = ['interpreter-core', '3.10', 'performance']
    title = 'Micro optimization for range.index if step is 1'
    updated_at = <Date 2020-10-21.02:39:25.414>
    user = 'https://github.com/corona10'

    bugs.python.org fields:

    activity = <Date 2020-10-21.02:39:25.414>
    actor = 'corona10'
    assignee = 'corona10'
    closed = True
    closed_date = <Date 2020-10-21.02:39:25.414>
    closer = 'corona10'
    components = ['Interpreter Core']
    creation = <Date 2020-10-01.13:42:45.832>
    creator = 'corona10'
    dependencies = []
    files = ['49482', '49483', '49485']
    hgrepos = []
    issue_num = 41902
    keywords = ['patch']
    message_count = 18.0
    messages = ['377752', '377753', '377755', '377757', '377759', '377762', '377765', '377770', '377773', '377775', '377786', '377794', '377795', '377796', '377799', '377806', '379178', '379183']
    nosy_count = 8.0
    nosy_names = ['mark.dickinson', 'vstinner', 'methane', 'serhiy.storchaka', 'ammar2', 'corona10', 'pablogsal', 'shihai1991']
    pr_nums = ['22479', '22480', '22492']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue41902'
    versions = ['Python 3.10']

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 1, 2020

    if we declare range without setting step, we don't have to process divide operation.

    here is the benchmark result both setting step and without step,

    With my patch, there is no performance degrade with range.index when the step is not one and showing 19% enhancement when the step is the default value (1) .

    Mean +- std dev: [range_master] 1.11 us +- 0.01 us -> [range_opt] 896 ns +- 23 ns: 1.24x faster (-19%)

    Mean +- std dev: [range_step_master] 1.12 us +- 0.02 us -> [range_step_opt] 1.11 us +- 0.01 us: 1.01x faster (-1%)

    @corona10 corona10 added the 3.10 only security fixes label Oct 1, 2020
    @corona10 corona10 self-assigned this Oct 1, 2020
    @corona10 corona10 added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage 3.10 only security fixes labels Oct 1, 2020
    @corona10 corona10 self-assigned this Oct 1, 2020
    @corona10 corona10 added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Oct 1, 2020
    @corona10
    Copy link
    Member Author

    corona10 commented Oct 1, 2020

    For optimization case,

    >>> a = range(0, 11111)
    >>> a.index(3)
    3

    @vstinner
    Copy link
    Member

    vstinner commented Oct 1, 2020

    PR 22479 avoids calling PyNumber_FloorDivide(a, b) if b == 1 (if b == _PyLong_One): it makes range.index(a, 1) call 214 ns faster. I'm surprised that PyNumber_FloorDivide(a, 1) takes 214 ns. Does it spend most time in binary_op1()? Or PyNumber_FloorDivide()?

    long_div(a, 1) is quite simple:

    CHECK_BINOP(a, b);
    
        if (Py_ABS(Py_SIZE(a)) == 1 && Py_ABS(Py_SIZE(b)) == 1) {
            return fast_floor_div((PyLongObject*)a, (PyLongObject*)b);

    with:

    static PyObject *
    fast_floor_div(PyLongObject *a, PyLongObject *b)
    {
        sdigit left = a->ob_digit[0];
        sdigit right = b->ob_digit[0];
        sdigit div;
        if (Py_SIZE(a) == Py_SIZE(b)) {
            div = left / right;
        }
        else {
            div = -1 - (left - 1) / right;
        }
        return PyLong_FromLong(div);
    }

    Do we need another fast-path in long_div(a, b) when b == _PyLong_One? Just return a in this case.

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 1, 2020

    For practical use case

    >> a = range(-2, 10)
    >> a.index(2)

    @vstinner
    Copy link
    Member

    vstinner commented Oct 1, 2020

    About your benchmark, do you build Python with PGO+LTO optimizations? What is your OS and CPU? Do you use CPU isolation on Linux? It would be good to use PGO+LTO optimizations.

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 1, 2020

    do you build Python with PGO+LTO optimizations

    Nope, but I will try to run the benchmark

    What is your OS and CPU? Do you use CPU isolation on Linux?

    macOS, intel i9 with 8 cores

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 1, 2020

    For the record with optimization

    Mean +- std dev: [range_master] 112 ns +- 3 ns -> [range_opt] 99.3 ns +- 1.8 ns: 1.13x faster (-12%)

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 1, 2020

    PR 22479 and PR 22480 are different approaches.

    I (and Victor) want to check which approach might be better.

    PR 22480 would affect overall long division performance
    PR 22479 assumes that step=1 case is very often (e.g range(100), range(-100, 100))

    @mdickinson
    Copy link
    Member

    Do we need another fast-path in long_div(a, b) when b == _PyLong_One? Just return a in this case.

    I'd much prefer not. Every extra fast path check costs time for the general case, and there's no strong reason to expect people to be dividing by one. The range code seems like the right place for this optimization, not the long-divide code.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 1, 2020

    Mark: "I'd much prefer not. Every extra fast path check costs time for the general case, and there's no strong reason to expect people to be dividing by one. The range code seems like the right place for this optimization, not the long-divide code."

    In this case, I suggest to add a comment in long_div() to explain why there is no fast-path in long_div(). Otherwise, I am likely to suggest the exact same optimization in 6 months :-D

    I'm thinking at something similar to my comment in ceval.c:

            case TARGET(BINARY_ADD): {
                PyObject *right = POP();
                PyObject *left = TOP();
                PyObject *sum;
                /* NOTE(haypo): Please don't try to micro-optimize int+int on
                   CPython using bytecode, it is simply worthless.
                   See http://bugs.python.org/issue21955 and
                   http://bugs.python.org/issue10044 for the discussion. In short,
                   no patch shown any impact on a realistic benchmark, only a minor
                   speedup on microbenchmarks. */

    This comment is the outcome of 2 years of hard work by many developers :-D See bpo-21955.

    @ammaraskar
    Copy link
    Member

    Out of curiosity, what is the motivation for this optimization?

    Is there reason to believe that range.index is performed often enough that it's worth the extra code and maintenance cost here?

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 2, 2020

    s there reason to believe that range.index is performed often enough that it's worth the extra code and maintenance cost here?

    There are at least 3 reasons

    1. pointer comparaition optimization quite common usecase in CPython codebase. e.x) list.count
    2. if you create the range object, step = 1 which is singleton object is quite high percentage use case.
      • range(100) -> step = 1
      • range(-100, 100) -> step =1
      • range(-100, 100, 2) -> step != 1
    3. fast path code does not cost high difficulty maintainence for this case but can bring 12% performance enhancement.

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 2, 2020

    I found another optimal case and this looks more practical usecase than range.index

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 2, 2020

    Mean +- std dev: [master-compute] 317 ns +- 3 ns -> [bpo-41902-compute] 287 ns +- 6 ns: 1.11x faster (-10%)

    @corona10
    Copy link
    Member Author

    corona10 commented Oct 2, 2020

    @vstinner @pablo @mark

    On my local machine (without cpu isolation),
    PR 22480 does not affect performance issues.

    import pyperf
    
    runner = pyperf.Runner()
    runner.timeit(name="bench long divide",
                  stmt="""
    for i in range(1, 256):
        a = 10000 // i
    """)

    but I think that my benchmark does not cover the worst case.

    I need to set up the CPU isolation environment but my resource is limited.
    (I need a Linux machine with sufficient permission but not)

    PR 22479 and PR 22480 has two sides of assumption.

    PR 22479: PyNumber_FloorDivide is a heavy operation if the user thinks that this is an unnecessary operation it should be avoided.
    (In this case divide by one)
    PR 22480: PyNumber_FloorDivide should process handle unnecessary operations smartly.

    In conclusion, I'd like to +1 on mark's decision.

    • PR 22480: even though the divisor value is one, if the dividend is not qualified from PyLong_CheckExact it will not get an optimization path.
      So it will not cover all the 'divide by one' case and it can cause performance issues.
    • PR 22480: Always optimized if the user does not set a specific step value and this will be same effect on PR 22492.

    @pablogsal
    Copy link
    Member

    I would need to carefully look at the PRs to estimate the maintenance cost, but it seems to me initially that all these operations happen very rarely in regular code and probably will have no impact on macro benchmarks. In general, I dislike branches because they quickly can become the source of slight different semantics.

    @methane
    Copy link
    Member

    methane commented Oct 21, 2020

    New changeset 25492a5 by Dong-hee Na in branch 'master':
    bpo-41902: Micro optimization for compute_item of range (GH-22492)
    25492a5

    @corona10
    Copy link
    Member Author

    New changeset c0f22fb by Dong-hee Na in branch 'master':
    bpo-41902: Micro optimization for range.index if step is 1 (GH-22479)
    c0f22fb

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants