Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make statistics.median_grouped more efficient #70190

Closed
upendra-k14 mannequin opened this issue Jan 3, 2016 · 6 comments
Closed

make statistics.median_grouped more efficient #70190

upendra-k14 mannequin opened this issue Jan 3, 2016 · 6 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@upendra-k14
Copy link
Mannequin

upendra-k14 mannequin commented Jan 3, 2016

BPO 26002
Nosy @rhettinger, @stevendaprano
Files
  • median_grouped.patch: Improving the performance of counting number of occurence of 'x' and finding leftmost position of 'x'
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-05-06.12:04:27.375>
    created_at = <Date 2016-01-03.14:21:58.850>
    labels = ['library', 'performance']
    title = 'make statistics.median_grouped more efficient'
    updated_at = <Date 2016-05-06.12:04:27.375>
    user = 'https://bugs.python.org/upendra-k14'

    bugs.python.org fields:

    activity = <Date 2016-05-06.12:04:27.375>
    actor = 'berker.peksag'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-05-06.12:04:27.375>
    closer = 'berker.peksag'
    components = ['Library (Lib)']
    creation = <Date 2016-01-03.14:21:58.850>
    creator = 'upendra-k14'
    dependencies = []
    files = ['41484']
    hgrepos = []
    issue_num = 26002
    keywords = ['patch']
    message_count = 6.0
    messages = ['257419', '259095', '259102', '259104', '260104', '264847']
    nosy_count = 4.0
    nosy_names = ['rhettinger', 'steven.daprano', 'python-dev', 'upendra-k14']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26002'
    versions = ['Python 3.6']

    @upendra-k14
    Copy link
    Mannequin Author

    upendra-k14 mannequin commented Jan 3, 2016

    statistics.median_grouped currently uses cf=data.index(x) to find the leftmost position of x in data ( line number 445 ). Here, data.index() has linear time complexity, which may not be good for long lists.

    But, here since the list 'data' is sorted beforehand, we can use binary search ( bisect_left() ) to find the leftmost occurrence of 'x' in 'data'.

    Similarly, for counting number of occurrence of 'x' in data after sorting, we can find the position of rightmost occurrence of 'x' in data[l1....len(data)], where l1 is position of leftmost occurrence of 'x' (line number 447 )

    @upendra-k14 upendra-k14 mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jan 3, 2016
    @upendra-k14
    Copy link
    Mannequin Author

    upendra-k14 mannequin commented Jan 28, 2016

    Can someone please review my patch? I think it can increase the performance significantly of the median_grouped() function. But, I don't have any standard way of checking if it would improve or will be an overkill for the median_grouped() function?

    @rhettinger
    Copy link
    Contributor

    This code looks good and will certainly reduce the number of comparisons in non-trivial cases.

    FWIW, It looks like the index functions are lifted directly from the bisect docs.

    Steven, any objections?

    @upendra-k14
    Copy link
    Mannequin Author

    upendra-k14 mannequin commented Jan 28, 2016

    Yeah, I slightly modified the functions from the bisect docs to suit for the purpose here.

    @stevendaprano
    Copy link
    Member

    Looks good to me.

    I've run some quick timing tests, and for very small lists, there's no significant difference, but for larger lists I'm getting up to a 50% speedup. Nicely done, thank you.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 4, 2016

    New changeset 7b2fafd78c1d by Steven D'Aprano in branch 'default':
    bpo-26002 and 25974
    https://hg.python.org/cpython/rev/7b2fafd78c1d

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants