Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dict: Use smaller entry for Unicode-key only dict. #91001

Closed
methane opened this issue Feb 24, 2022 · 6 comments
Closed

dict: Use smaller entry for Unicode-key only dict. #91001

methane opened this issue Feb 24, 2022 · 6 comments
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@methane
Copy link
Member

methane commented Feb 24, 2022

BPO 46845
Nosy @rhettinger, @terryjreedy, @methane, @markshannon
PRs
  • bpo-46845: Reduce dict size when all keys are Unicode. #31564
  • bpo-46845: Move check for str-only keys in LOAD_GLOBAL to specialization time. #31659
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-03-01.23:10:28.750>
    created_at = <Date 2022-02-24.07:40:55.587>
    labels = ['interpreter-core', '3.11', 'performance']
    title = 'dict: Use smaller entry for Unicode-key only dict.'
    updated_at = <Date 2022-03-03.14:10:01.042>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2022-03-03.14:10:01.042>
    actor = 'Mark.Shannon'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-03-01.23:10:28.750>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2022-02-24.07:40:55.587>
    creator = 'methane'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46845
    keywords = ['patch']
    message_count = 6.0
    messages = ['413892', '414040', '414045', '414091', '414135', '414317']
    nosy_count = 4.0
    nosy_names = ['rhettinger', 'terry.reedy', 'methane', 'Mark.Shannon']
    pr_nums = ['31564', '31659']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46845'
    versions = ['Python 3.11']

    @methane
    Copy link
    Member Author

    methane commented Feb 24, 2022

    Currently, PyDictKeyEntry is 24bytes (hash, key, and value).

    We can drop the hash from entry when all keys are unicode, because unicode objects caches hash already.

    This will cause some performance regression on microbenchmark because dict need one more indirect access to compare hash value.

    On the other hand, this will reduce some RAM usage. Additionally, unlike docstrings and annotations, this includes much **hot** RAM. It will make Python more cache efficient.

    This is work in progress code: methane#43
    pypeformance result is in the PR too.

    @methane methane added 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 24, 2022
    @terryjreedy
    Copy link
    Member

    CPython, at least, allows users to insert non-string keys in namespace dicts that are conceptually string-key only.

    >>> globals()[0] = 'zero'
    >>> globals()[0]
    'zero'
    >>> vars()
    {'__name__': '__main__', ..., 0: 'zero'}
    
    [This is for consenting adults only, as it prevents sorting keys and string-only operations on keys.
    >>> dir()
    ...
    TypeError: '<' not supported between instances of 'int' and 'str']
     
    Do you propose to
    1. Only use StringKeyDicts when non-string keys are not possible?  (Where would this be?)
    2. Switch to a normal dict when a non-string key is added?  (But likely not switch back when the last non-string key is removed.)
    3. Deprecate and remove the option to add non-string keys to namespace dicts?  (Proposed and rejected at least once as not gaining much.)

    @methane
    Copy link
    Member Author

    methane commented Feb 25, 2022

    Do you propose to

    1. Only use StringKeyDicts when non-string keys are not possible? (Where
      would this be?)
    2. Switch to a normal dict when a non-string key is added? (But likely
      not switch back when the last non-string key is removed.)
    3. Deprecate and remove the option to add non-string keys to namespace
      dicts? (Proposed and rejected at least once as not gaining much.)
    1. We already do such hack for key sharing dict.
      And yes, deleting non string key doesn't switch back. d[0]=0; del d[0];
      loop must be amortized O(1).
      Only dict.clear() switches back.

    @methane
    Copy link
    Member Author

    methane commented Feb 26, 2022

    In most case, first PyDict_SetItem decides which format should be used.

    But _PyDict_NewPresized() can be a problem. It creates a hash table before inserting the first key, when 5 < (expected size) < 87382.

    In CPython code base, _PyDict_NewPresized() is called from three places:

    1. call.c: Building kwargs dict -- all key should be Unicode.
    2. ceval.c: BUILD_MAP and BUILD_CONST_KEY_MAP -- there is no guarantee that all keys are Unicode.

    Current pull request assumes the dict keys are unicode-only key. So building dict from non-Unicode keys become slower.

    $ ./python -m pyperf timeit --compare-to ../cpython/python -- '{(1,2):3, (4,5):6, (7,8):9, (10,11):12, (13,14):15, (16,17):18}'
    /home/inada-n/work/python/cpython/python: ..................... 233 ns +- 1 ns
    /home/inada-n/work/python/dict-compact/python: ..................... 328 ns +- 6 ns
    
    Mean +- std dev: [/home/inada-n/work/python/cpython/python] 233 ns +- 1 ns -> [/home/inada-n/work/python/dict-compact/python] 328 ns +- 6 ns: 1.41x slower
    

    There are some approaches to fix this problem:

    1. Don't use _PyDict_NewPresized() in BUILD_MAP, BUILD_CONST_KEY_MAP
    $ ./python -m pyperf timeit --compare-to ../cpython/python -- '{(1,2):3, (4,5):6, (7,8):9, (10,11):12, (13,14):15, (16,17):18}'
    /home/inada-n/work/python/cpython/python: ..................... 233 ns +- 1 ns
    /home/inada-n/work/python/dict-compact/python: ..................... 276 ns +- 1 ns
    
    Mean +- std dev: [/home/inada-n/work/python/cpython/python] 233 ns +- 1 ns -> [/home/inada-n/work/python/dict-compact/python] 276 ns +- 1 ns: 1.18x slower
    

    I think this performance regression is acceptable level.

    1. Add an argument unicode to _PyDict_NewPresized(). -- Breaks some 3rd party codes using internal APIs.
    2. Add a new internal C API such that _PyDict_NewPresizedUnicodeKey(). -- Most conservative.
    3. Add a new internal C API that creates dict form keys and values for extreme performance, like this:

    // Create a new dict from keys and values.
    // Items are received as {keys[i*keys_offset]: values[i*values_offset] for i in range(length)}.
    // When distinct=1, this function skips checking duplicated keys.
    // So pass distinct=1 unless you can guarantee that there is no duplicated keys.
    PyObject *
    PyDict_FromKeysAndValues(PyObject **keys, Py_ssize_t keys_offset, PyObject **values, Py_ssize_t values_offset, Py_ssize_t lenghh, int distincit)
    {
    }

    @methane
    Copy link
    Member Author

    methane commented Feb 27, 2022

    I added _PyDict_FromItems() to the PR.
    It checks that all keys are Unicode or not before creating dict.
    _PyDict_NewPresized() just returns general-purpose dict. But it isn't used from CPython core. It is just kept for compatibility (for Cython).

    $ ./python -m pyperf timeit --compare-to ../cpython/python -- '{"k1":1, "k2":2, "k3":3, "k4":4, "k5":5, "k6":6}'
    /home/inada-n/work/python/cpython/python: ..................... 198 ns +- 5 ns
    /home/inada-n/work/python/dict-compact/python: ..................... 213 ns +- 6 ns
    
    Mean +- std dev: [/home/inada-n/work/python/cpython/python] 198 ns +- 5 ns -> [/home/inada-n/work/python/dict-compact/python] 213 ns +- 6 ns: 1.07x slower
    

    Overhead of checking keys types is not so large.
    Additionally, we can reduce some code from ceval.c.

    @methane
    Copy link
    Member Author

    methane commented Mar 1, 2022

    New changeset 9833bb9 by Inada Naoki in branch 'main':
    bpo-46845: Reduce dict size when all keys are Unicode (GH-31564)
    9833bb9

    @methane methane closed this as completed Mar 1, 2022
    @methane methane closed this as completed Mar 1, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants