classification
Title: Add a macro for dict size
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: inada.naoki, python-dev, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2016-12-13 13:07 by serhiy.storchaka, last changed 2016-12-16 15:09 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
_PyDict_GET_SIZE.patch serhiy.storchaka, 2016-12-13 13:06 review
_PyDict_GET_SIZE-2.patch serhiy.storchaka, 2016-12-13 18:56 review
PyDict_GET_SIZE-3.patch serhiy.storchaka, 2016-12-14 09:59 review
Messages (15)
msg283099 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-13 13:06
In additional to C API functions PyTuple_Size() and PyList_Size() there are fast macros PyTuple_GET_SIZE() and PyList_GET_SIZE() for getting the size of a tuple or a list. But for dicts there is just PyDict_Size(). It is not just slower than a macro, but can return an error (actually this never happens in CPython core and extensions).

Proposed patch adds new private macro _PyDict_GET_SIZE() and makes the code using it instead of PyDict_Size(). I don't expect significant performance gain except perhaps few checks that a dict is empty. The main advantage to me is that not checking the result for error no longer looks as potential bug.
msg283123 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-12-13 16:56
It looks like the struct of a dict is already arranged in a way that it could switch to PyObject_VAR_HEAD which would let you use Py_SIZE everywhere.
msg283131 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-13 18:56
Following patch also switches the struct of a dict to PyObject_VAR_HEAD.

Should we use Py_SIZE or _PyDict_GET_SIZE? It is easier to replace _PyDict_GET_SIZE with Py_SIZE than restore dict-specific name from some of Py_SIZE-s.
msg283158 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-12-14 02:20
I didn't know about PyObject_VAR_HEAD much.
The comment of the macro says:

/* PyObject_VAR_HEAD defines the initial segment of all variable-size
 * container objects.  These end with a declaration of an array with 1
 * element, but enough space is malloc'ed so that the array actually
 * has room for ob_size elements.  Note that ob_size is an element count,
 * not necessarily a byte count.
 */

dict doesn't end with array.
Does Py_SIZE() support recommended, like Python's len()?
msg283163 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-14 05:40
list doesn't end with array too, but uses PyObject_VAR_HEAD. tp_itemsize is 0, therefore no additional space at the end is allocated.

Maybe this is overuse of PyObject_VAR_HEAD.
msg283174 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-14 08:54
> It looks like the struct of a dict is already arranged in a way that it could switch to PyObject_VAR_HEAD

Can someone check if it has an impact of the size of the structure (because of the complex rules of alignment)? If the structure has the same size, I'm in favor of using PyObject_VAR_HEAD.

I reviewed _PyDict_GET_SIZE-2.patch: I like the overall change, but I added many comments.

When changes are not direct replacement ma_used => _PyDict_GET_SIZE(), please push the changes in a separated commit when you will push the whole change (I'm thinking to a micro-optimization in ceval.c).

In typeobject.c, you added a check to test if *dictptr is a dict or not. This change seems strange to me. I'm not sure about the right behaviour here. You might open a separated issue for this change. I don't know if an exception should be raised or not.

And my comments on the macro itself:

Please add a tiny inline documentation:

   /* Get the number of items of a dictionary. */

I don't think that it's worth it to make the macro private, just name it PyDict_GET_SIZE(). The macro is not defined in the limited API, and we already expose like "everything" in the default API... No need to make the API harder to use. I was always surprised to have to check if the result if PyDict_Size() is negative and that no macro existed.

Your change is a large and so hard to review. I would be more confident if you add an assertion here, as done in crazy macros of unicodeobject.h:

   #define _PyDict_GET_SIZE(mp) (assert(PyDict_Check(mp),Py_SIZE(mp))

We can add such assertion since it's a new macro. I'm not sure that we can do it in existing macros like PyTuple_GET_SIZE(), since these macros may be abused to modify the size, not only get it.
msg283175 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-14 08:57
> Should we use Py_SIZE or _PyDict_GET_SIZE?

I prefer _PyDict_GET_SIZE() just to make the code more readable: it helps to repeat the type of variables.

Moreover, it can give you a free check on the type if you agree my suggestion to add an assertion into the macro directly.

Technically, it's already possible to use Py_SIZE() on lists and tuples, but PyTuple_GET_SIZE() and PyList_GET_SIZE() are used almost everywhere, again, for readability.
msg283176 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-14 09:08
> dict doesn't end with array.

Right, but...

Recently I looked at dict internals. As a newcomer, I have to confess that it's currently really hard to understand the purpose of each dict field: they are many "sizes": size of the hash table, number of usable entries, number of used entries, number of items in the dictionary, etc.

I like the idea of using the standard ob_size field (PyVarObject) to make it more explicitl that this field is the expected result of len(obj).

Technically, I don't know any function inside Python core which rely on the fact that object data is at the end of the main memory block.

Other builtin Python types:

* tuple: PyVarObject
* list: PyVarObject
* bytes: PyVarObject
* bytearray: PyVarObject
* memoryview: PyVarObject
* set: "used" field
* str: "length" field

The str type is super optimized for memory footprint, there are technical reasons for not used PyVarObject here, like backward compatibility.

It may make sense to modify PySetObject to use PyVarObject as well, but that's a different topic :-)
msg283181 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-14 09:59
Updated patch addresses Victor's issue. _PyDict_GET_SIZE is renamed to PyDict_GET_SIZE and now it includes an assertion. This is good argument for introducing this macro against using PyDict_Size (without checking the result for error) and Py_SIZE (which doesn't check the type).

> Can someone check if it has an impact of the size of the structure (because of the complex rules of alignment)?

There are special tests for that.

Actually I think that switching to PyObject_VAR_HEAD is different issue. The patch can be pushed without changes to dictobject.c and the structure of a dict.
msg283197 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-14 15:15
I wrote a short script to check the size of dict:
---
import sys

def size(obj):
    print(sys.getsizeof(obj))

size({})
size({i:i for i in range(3)})
size({i:i for i in range(10)})
size({i:i for i in range(100)})
---

On Linux x86_64, the sizes don't change with the patch:

240
240
368
4704

A size increase would be a regression, but it's not the case, so it's fine ;-)
msg283200 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-14 15:37
PyDict_GET_SIZE-3.patch: Except of the changes in _datetimemodule.c and typeobject.c on *dictptr type, the patch LGTM. I suggest to open a new issue for these two specific changes, since it changes the behaviour and it's a weird corner case unrelated to this issue.
msg283201 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-14 15:59
I don't think the changes in _datetimemodule.c and typeobject.c change the behavior.
msg283204 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-14 16:08
Ignore my previous comment, PyDict_GET_SIZE-3.patch LGTM.


> I don't think the changes in _datetimemodule.c and typeobject.c change the behavior.

Oh wait, it seems like I misunderstood the new code.

Modules/_datetimemodule.c:

         dictptr = _PyObject_GetDictPtr(self);
-        if (dictptr && *dictptr && PyDict_Size(*dictptr)) {
+        if (dictptr && *dictptr &&
+            (!PyDict_Check(*dictptr) || PyDict_GET_SIZE(*dictptr))) {
             state = *dictptr;

If *dictptr is set and is not a dict, the test is true for the old and new code. Hum, I misunderstood the new code: I understood that (!PyDict_Check(*dictptr) || PyDict_GET_SIZE(*dictptr)) is false is *dictptr is not a dict, but it's true.

Hum, the only difference is that the old code raises an exception (SystemError), the new code doesn't raise an exception.

Since I'm not able to create an object with a __dict__ attribute which is not a dict, I don't really case of the small behaviour change.
msg283400 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-12-16 14:19
New changeset dbf72357cb4a by Serhiy Storchaka in branch 'default':
Issue #28959: Added private macro PyDict_GET_SIZE for retrieving the size of dict.
https://hg.python.org/cpython/rev/dbf72357cb4a
msg283406 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-16 15:09
After further analysis I conclude that instance dict can be only NULL or a dict. Therefore PyDict_Size() is correctly used in typeobject.c and _datetimemodule.c. It never fails, and additional PyDict_Check() is not needed.

Committed changes don't include switching a dict structure to PyVarObject. Separate issue28988 is opened for this.

Thanks Raymond, Inada and Victor for your reviews.
History
Date User Action Args
2016-12-16 15:09:04serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg283406

stage: patch review -> resolved
2016-12-16 14:19:17python-devsetnosy: + python-dev
messages: + msg283400
2016-12-14 16:08:41vstinnersetmessages: + msg283204
2016-12-14 15:59:23serhiy.storchakasetmessages: + msg283201
2016-12-14 15:37:43vstinnersetmessages: + msg283200
2016-12-14 15:15:03vstinnersetmessages: + msg283197
2016-12-14 09:59:13serhiy.storchakasetfiles: + PyDict_GET_SIZE-3.patch

messages: + msg283181
2016-12-14 09:08:18vstinnersetmessages: + msg283176
2016-12-14 08:57:47vstinnersetmessages: + msg283175
2016-12-14 08:54:33vstinnersetmessages: + msg283174
2016-12-14 05:40:02serhiy.storchakasetmessages: + msg283163
2016-12-14 02:20:16inada.naokisetmessages: + msg283158
2016-12-13 18:56:46serhiy.storchakasetfiles: + _PyDict_GET_SIZE-2.patch

messages: + msg283131
2016-12-13 16:56:40rhettingersetnosy: + rhettinger
messages: + msg283123
2016-12-13 13:07:01serhiy.storchakacreate