Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the PyDict C API to handle cases where the hash value is known #65300

Closed
rhettinger opened this issue Mar 30, 2014 · 16 comments
Closed

Extend the PyDict C API to handle cases where the hash value is known #65300

rhettinger opened this issue Mar 30, 2014 · 16 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 21101
Nosy @rhettinger, @terryjreedy, @pitrou, @alex, @bitdancer, @serhiy-storchaka, @MojoVampire
Files
  • known_hash.diff: Add two functions to dictobject
  • applied_known_hash.diff: Sample application to the collections module
  • double_counter_hash.py: Demonstrate unnecessary double hashing in a counter
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-05-03.23:42:02.179>
    created_at = <Date 2014-03-30.08:47:09.553>
    labels = ['interpreter-core', 'type-feature']
    title = 'Extend the PyDict C API to handle cases where the hash value is known'
    updated_at = <Date 2014-05-03.23:42:02.178>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2014-05-03.23:42:02.178>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-05-03.23:42:02.179>
    closer = 'rhettinger'
    components = ['Interpreter Core']
    creation = <Date 2014-03-30.08:47:09.553>
    creator = 'rhettinger'
    dependencies = []
    files = ['34666', '34667', '34668']
    hgrepos = []
    issue_num = 21101
    keywords = ['patch']
    message_count = 16.0
    messages = ['215175', '215180', '215186', '215187', '215192', '215560', '215561', '215562', '215570', '215573', '215575', '215578', '216633', '216635', '217845', '217848']
    nosy_count = 9.0
    nosy_names = ['rhettinger', 'terry.reedy', 'pitrou', 'Arfrever', 'alex', 'r.david.murray', 'python-dev', 'serhiy.storchaka', 'josh.r']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue21101'
    versions = ['Python 3.5']

    @rhettinger rhettinger added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Mar 30, 2014
    @rhettinger
    Copy link
    Contributor Author

    Propose adding two functions, PyDict_GetItem_KnownHash() and PyDict_SetItem_KnownHash().

    It is reasonably common to make two successive dictionary accesses with the same key. This results in calling the hash function twice to compute the same result.

    For example, the technique can be used to speed-up collections.Counter (see the attached patch to show how). In that patch, the hash is computed once, then used twice (to retrieve the prior count and to store the new count.

    There are many other places in the standard library that could benefit:

    [Modules/posixmodule.c](https://github.com/python/cpython/blob/main/Modules/posixmodule.c) 1254
    [Modules/pyexpat.c](https://github.com/python/cpython/blob/main/Modules/pyexpat.c) 343 and 1788 and 1798
    [Modules/_json.c](https://github.com/python/cpython/blob/main/Modules/_json.c) 628 and 1446 and 1515 and 1697
    [Modules/selectmodule.c](https://github.com/python/cpython/blob/main/Modules/selectmodule.c) 465
    [Modules/zipmodule.c](https://github.com/python/cpython/blob/main/Modules/zipmodule.c) 137
    [Objects/typeobject.c](https://github.com/python/cpython/blob/main/Objects/typeobject.c) 6678 and 6685
    [Objects/unicodeobject.c](https://github.com/python/cpython/blob/main/Objects/unicodeobject.c) 14997
    [Python/_warnings.c](https://github.com/python/cpython/blob/main/Python/_warnings.c) 195
    [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c) 1134
    [Python/import.c](https://github.com/python/cpython/blob/main/Python/import.c) 1046 and 1066
    [Python/symtable](https://github.com/python/cpython/blob/main/Python/symtable) 671 and 687 and 1068
    

    A similar technique has been used for years in the Objects/setobject.c internals as a way to eliminate unnecessary calls to PyObject_Hash() during set-to-set and set-to-dict operations.

    The benefit is biggest for objects such as tuples or user-defined classes that have to recompute the hash on every call on PyObject_Hash().

    @pitrou
    Copy link
    Member

    pitrou commented Mar 30, 2014

    Is there any benefit in making them public API functions?

    @rhettinger
    Copy link
    Contributor Author

    Is there any benefit in making them public API functions?

    Originally, I was going to suggest them as internal functions, but the variety of use cases in the standard library suggested that third-party C extensions would benefit as well.

    Since this isn't functionality a user can create themselves, a public function is the only way to expose this service.

    @alex
    Copy link
    Member

    alex commented Mar 30, 2014

    Would it be reasonable to develop a Python API for this? If C functions have a need to do this, surely Python code does as well.

    @rhettinger rhettinger changed the title Extend the PyDict C API to handle cases where the hash value in known Extend the PyDict C API to handle cases where the hash value is known Mar 30, 2014
    @rhettinger
    Copy link
    Contributor Author

    Would it be reasonable to develop a Python API for this?

    I suspect that in pure Python, the overhead would exceed the benefit.

    Current code:

    d[key] = d[key] + 1

    Effort to save a double hash:

       h = hash(key)
       c = d.getitem_known_hash(key, hash)
       d.setitem_known_hash(key, hash, c + 1)

    In PyPy, the second code sample might actually be faster that the first, but all the other pythons suffer from interpreter overhead, a double dict lookup for the hash() function, two dict lookups just to find the new methods, allocating a bound method, and forming two argument tuples.

    There is also an API design issue. The pure python core datatype APIs are designed in a way to encourage higher level thinking (that is why we can't directly manage hash table sparsity or list over-allocation for example).

    In contrast, the C API is aimed at users who seek additional control and optimization at a lower level than the core language provides. We tend to expose a number of tools at the C level that would be inappropriate for higher-level code.

    @terryjreedy
    Copy link
    Member

    While the question is reasonable, I agree with Raymond's answer. As a python programmer, I would not like to see
    d.setitem_known_hash(key, hash, d.getitem_known_hash(key, hash) + 1)

    Of course, "d[key] += 1" already solves the double lookup issue at the Python level. Moreover, it abbreviates the form, rather than expanding it, which is appropriate since it abbreviates the computation.

    You could optimize get-set even more than the current proposal by saving a reference to the slot corresponding to a key rather than the hash that leads to a slot. Exposing a slot reference probably breaks encapsulation too much. This could be avoided by another alternative: add PyDict_Mod(ify)Item(mapping, key, func). It would combine get and set: find slot, get item, set func(item), and return whatever SetItem does on success/failure.

    @alex
    Copy link
    Member

    alex commented Apr 4, 2014

    d[key] += 1 still does two dict lookups, and invokes the hash function twice:

    >>> class X(object):
    ...   def __hash__(self):
    ...     print "hashed"
    ...     return 0
    ...   def __eq__(self, other):
    ...     return True
    ...
    >>> d = {X(): 0}
    hashed
    >>> d[X()]
    hashed
    0
    >>> d[X()] = 3
    hashed
    >>> d[X()] += 1
    hashed
    hashed

    @pitrou
    Copy link
    Member

    pitrou commented Apr 4, 2014

    Of course, "d[key] += 1" already solves the double lookup issue at the
    Python level.

    What the hell are you talking about?

    @terryjreedy
    Copy link
    Member

    "What the hell I am talking about" is what the doc says. 'd[key]' is written just once and "is evaluated just once".
    https://docs.python.org/3/reference/simple_stmts.html#augmented-assignment-statements

    PS: Try being a bit more polite.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 4, 2014

    PS: Try being a bit more polite.

    You could definitely do some research before posting erroneous
    statements (this one isn't difficult to check, as Alex showed).
    Especially when the other posters (Alex and Raymond) are a lot more
    competent than you on the topic at hand.

    If you actually try to *reason* about it, there is no other way for:
    x[k] += <some_expr>

    to work in the general case than to execute
    x.__setitem__(k, x.__getitem__(k) + <some_expr>)

    So, yes, the lookup is done twice, because it currently can't work
    otherwise.

    @bitdancer
    Copy link
    Member

    Antoine, being polite never hurts. Terry is a valuable member of the community and sure, he sometimes makes mistakes (or trusts the docs too much?). So do the the rest of us.

    @terryjreedy
    Copy link
    Member

    Raymond identified a need and a possible solution. The important part of my post was suggesting another possible solution. Please focus on that.

    @rhettinger
    Copy link
    Contributor Author

    Antoine, do you support adding these as part of the public API? If not, I can make them private.

    I think the functions are broadly useful, but no one has ever asked for this functionality either.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 17, 2014

    I think we can start with making them private. Do you know of any third-party code bases which may be interested in the speedup?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 3, 2014

    New changeset 39f475aa0163 by Raymond Hettinger in branch 'default':
    bpo-21101: Internal API for dict getitem and setitem where the hash value is known.
    http://hg.python.org/cpython/rev/39f475aa0163

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 3, 2014

    New changeset 592a57682ced by Raymond Hettinger in branch 'default':
    Issue bpo-21101: Eliminate double hashing in the C code for collections.Counter().
    http://hg.python.org/cpython/rev/592a57682ced

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants