This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author methane
Recipients Mark.Shannon, methane, rhettinger, terry.reedy
Date 2022-02-26.11:15:16
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1645874117.01.0.113083918904.issue46845@roundup.psfhosted.org>
In-reply-to
Content
In most case, first PyDict_SetItem decides which format should be used.

But _PyDict_NewPresized() can be a problem. It creates a hash table before inserting the first key, when 5 < (expected size) < 87382.

In CPython code base, _PyDict_NewPresized() is called from three places:

1. call.c: Building kwargs dict -- all key should be Unicode.
2. ceval.c: BUILD_MAP and BUILD_CONST_KEY_MAP -- there is no guarantee that all keys are Unicode.


Current pull request assumes the dict keys are unicode-only key. So building dict from non-Unicode keys become slower.

```
$ ./python -m pyperf timeit --compare-to ../cpython/python -- '{(1,2):3, (4,5):6, (7,8):9, (10,11):12, (13,14):15, (16,17):18}'
/home/inada-n/work/python/cpython/python: ..................... 233 ns +- 1 ns
/home/inada-n/work/python/dict-compact/python: ..................... 328 ns +- 6 ns

Mean +- std dev: [/home/inada-n/work/python/cpython/python] 233 ns +- 1 ns -> [/home/inada-n/work/python/dict-compact/python] 328 ns +- 6 ns: 1.41x slower
```

There are some approaches to fix this problem:

1. Don't use _PyDict_NewPresized() in BUILD_MAP, BUILD_CONST_KEY_MAP

```
$ ./python -m pyperf timeit --compare-to ../cpython/python -- '{(1,2):3, (4,5):6, (7,8):9, (10,11):12, (13,14):15, (16,17):18}'
/home/inada-n/work/python/cpython/python: ..................... 233 ns +- 1 ns
/home/inada-n/work/python/dict-compact/python: ..................... 276 ns +- 1 ns

Mean +- std dev: [/home/inada-n/work/python/cpython/python] 233 ns +- 1 ns -> [/home/inada-n/work/python/dict-compact/python] 276 ns +- 1 ns: 1.18x slower
```

I think this performance regression is acceptable level.

2. Add an argument `unicode` to _PyDict_NewPresized(). -- Breaks some 3rd party codes using internal APIs.
3. Add a new internal C API such that _PyDict_NewPresizedUnicodeKey(). -- Most conservative.
4. Add a new internal C API that creates dict form keys and values for extreme performance, like this:

// Create a new dict from keys and values.
// Items are received as `{keys[i*keys_offset]: values[i*values_offset] for i in range(length)}`.
// When distinct=1, this function skips checking duplicated keys.
// So pass distinct=1 unless you can guarantee that there is no duplicated keys.
PyObject *
PyDict_FromKeysAndValues(PyObject **keys, Py_ssize_t keys_offset, PyObject **values, Py_ssize_t values_offset, Py_ssize_t lenghh, int distincit)
{
}
History
Date User Action Args
2022-02-26 11:15:17methanesetrecipients: + methane, rhettinger, terry.reedy, Mark.Shannon
2022-02-26 11:15:17methanesetmessageid: <1645874117.01.0.113083918904.issue46845@roundup.psfhosted.org>
2022-02-26 11:15:17methanelinkissue46845 messages
2022-02-26 11:15:16methanecreate