New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The __module__ attribute of non-heap classes is not interned #70043
Comments
One of purposes of the STACK_GLOBAL opcode introduced in pickle protocol 4 is to avoid repeating module name for different globals of the same module. >>> pickletools.dis(pickletools.optimize(pickle.dumps([sys.getsizeof, sys.intern], 4)))
0: \x80 PROTO 4
2: \x95 FRAME 33
11: ] EMPTY_LIST
12: ( MARK
13: \x8c SHORT_BINUNICODE 'sys'
18: \x94 MEMOIZE (as 0)
19: \x8c SHORT_BINUNICODE 'getsizeof'
30: \x93 STACK_GLOBAL
31: h BINGET 0
33: \x8c SHORT_BINUNICODE 'intern'
41: \x93 STACK_GLOBAL
42: e APPENDS (MARK at 12)
43: . STOP
highest protocol among opcodes = 4 But this doesn't work with the itertools module. >>> pickletools.dis(pickletools.optimize(pickle.dumps([itertools.chain, itertools.accumulate], 4)))
0: \x80 PROTO 4
2: \x95 FRAME 47
11: ] EMPTY_LIST
12: ( MARK
13: \x8c SHORT_BINUNICODE 'itertools'
24: \x8c SHORT_BINUNICODE 'chain'
31: \x93 STACK_GLOBAL
32: \x8c SHORT_BINUNICODE 'itertools'
43: \x8c SHORT_BINUNICODE 'accumulate'
55: \x93 STACK_GLOBAL
56: e APPENDS (MARK at 12)
57: . STOP
highest protocol among opcodes = 4 That is because the __module__ attribute of itertools members is not interned. >>> sys.getsizeof.__module__ is sys.intern.__module__
True
>>> itertools.chain.__module__ is itertools.chain.__module__
False In addition to inefficient pickle this perhaps leads to small performance hit on accessing the __module__ attribute or using its value as dictionary key. |
Actually this is not specific to itertools. Every non-heap class not from the builtins module is affected. Proposed patch just interns the __module__ value. The patch also cleans up the code. |
Interning the __module__ string causes small performance hit: $ ./python -m timeit -s "from itertools import chain" -- "chain.__module__; chain.__module__; chain.__module__; chain.__module__; chain.__module__; chain.__module__; chain.__module__; chain.__module__; chain.__module__; chain.__module__" Unpatched: 1.93 usec per loop This can be avoided if cache created string in type's __dict__. Following patch makes __module__ retrieving for non-heap types as fast as for heap types: Patched2: 0.871 usec per loop |
Could anyone please make a review? |
I don't understand this code: |
Good catch Benjamin! There is yet one bug -- the type type already has the __module__ attribute, but it is a descriptor, not a string. Updated patch fixes these bugs. |
I don't understand why you need to check the validity of tp_dict at all. We generally assume it's a dict. |
Indeed, the PyDict_Check() check can be omitted. |
I don't think it can be NULL either. On Wed, Sep 7, 2016, at 11:36, Serhiy Storchaka wrote:
|
It can be NULL in very rare cases. See for example bpo-26906. |
I think it's better not write into the type dict in a getter. You might just use PyUnicode_InternFromString every time. FWIW, __name__ also has this behavior. |
This is what my first path does. But this slows down retrieving the __module__ attribute (from 0.2 usec to 0.4 usec on my computer). Maybe I haven't bother. Interning __name__ and __qualname__ is less important, because different functions and classes usually have different names. |
I'm not too worried about slowing down __module__ especially since it's On Thu, Sep 8, 2016, at 14:39, Serhiy Storchaka wrote:
|
New changeset 861ddad3e0c1 by Serhiy Storchaka in branch 'default': |
Thank you for helpful review Benjamin. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: