classification
Title: property_descr_get reuse argument tuple
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: barry, eric.smith, eric.snow, llllllllll, python-dev, rhettinger, serhiy.storchaka
Priority: low Keywords: patch

Created on 2015-04-10 20:10 by llllllllll, last changed 2015-05-24 13:12 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
namedtuple.patch llllllllll, 2015-04-10 20:10 nameduple in C. review
namedtuple-indexer.patch llllllllll, 2015-04-11 07:11 review
namedtuple-indexer-update.patch llllllllll, 2015-04-12 18:48 review
namedtuple-indexer-with-news.patch llllllllll, 2015-04-14 16:25 review
property.patch llllllllll, 2015-04-27 00:44 review
with-static-tuple.patch llllllllll, 2015-04-27 12:38 review
Messages (25)
msg240447 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-10 20:10
I am working on implementing nameduple in C; I am almost there; however, on the path of moving to full compatibility, I ran into a refcount issue somewhere. Hopefully someone can help me work this out.

To describe the issue, When I run the collections tests I most frequently segfault in a non namedtuple test; however, sometimes it runs (and passes) however that probably means I am using an unowned obect somewhere. I am new to the CPython API so I would love to go through this with someone to learn more about how it all works.

I am currently at PyCon and would absolutly love to speak to someone in person. I will be at CPython sprints for at least one day.
msg240448 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-04-10 20:44
What's the motivating use case for this?
msg240449 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-10 21:20
Ideally, namedtuple is used to make your code cleaner, using "magic" indecies is less clear than using a named index in a lot of cases. Because namedtuple is mainly to make the code more readable, I don't think that it should have an impact on the runtime performance of the code. This means that namedtuple should be a very thin wrapper around tuple. Currently, namedtuple named item access is much slower than elementwise access. I have this as a standalone package (there are some changes in the diff I posted to acheive full backwards compat) here https://pypi.python.org/pypi/cnamedtuple/0.1.5 that show some profiling runs of this code. The notable one to me is named item access being around twice as fast.

Another issue we ran into at work is that we found ways to get the exec call in nametuple to execute arbitrary code; while this would not be a concern for most people by the nature of the way we got this to happen, we wanted to look at other ways of writing namedtuple. I looked through some older discussion and found that non-exec solutions were rejected in the past for perfomance reasons.
msg240450 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-04-10 21:29
Have you thought of just exposing Object/structseq.c?

Before you put much time into this, I'd get Raymond's acceptance of whatever approach you want to take. It might be best to raise it on python-ideas.
msg240451 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-10 21:54
would the idea be to deprecate namedtuple in favor of a public structseq that is exposed through collections, or change structseq to fit the namedtuple API?
msg240452 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-04-10 22:25
I haven't seen thought it through, just that it seems very similar to a C namedtuple.
msg240465 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-11 07:11
I stripped down the patch to only the descriptor like we had discussed.
msg240535 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-12 04:25
Can you post before and afters timings of the patch?
msg240565 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-12 18:48
# Original version / new python implementation
./python -m timeit -s "from collections import namedtuple;a = namedtuple('a', 'a b c')(1, 2, 3)" "a.a"
10000000 loops, best of 3: 0.07 usec per loop


# C implementation
./python -m timeit -s "from collections import namedtuple;a = namedtuple('a', 'a b c')(1, 2, 3)" "a.a"
10000000 loops, best of 3: 0.028 usec per loop


The fallback is the same implementation that is currently used so this should have no affect on pypi.
msg240568 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-12 20:56
sorry, I meant pypy
msg240934 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-14 16:25
I am updating the patch to include an entry in Misc/NEWS.
msg242082 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-26 21:33
FWIW, the current property(itemgetter(index)) code has a Python creation step, but the actual attribute lookup and dispatch is done entirely in C (no pure python steps around the eval lookup).

Rather than making a user visible C hack directly to namedtuple, any optimization effort should be directly at improving the speed of property() and itemgetter().

Here are some quick notes to get you started:

* The overhead of property() is almost nothing.
* The code for property_descr_get() in Objects/descrobject.c
* It has two nearly zero cost predictable branches
* So the the only real work is a call to 
  PyObject_CallFunctionObjArgs(gs->prop_get, obj, NULL);
* which then calls both
       objargs_mktuple(vargs) 
  and 
       PyObject_Call(callable, args, NULL);
* That can be sped-up by using 
       PyTuple_New(1)
  and a direct call to  PyObject_Call()
* The logic in PyObject_Call can potentially be tightened
  in the context of a property(itemgetter(index)) call.
  Look to see whether recursion check is necessary
  (itemgetter on a tuple subclass that doesn't extend __getitem__
   is non-recursive)
* If so, then entire call to PyObject_Call() in property 
  can potentially be simplified to:
   result = (*call)(func, arg, kw);

I haven't looked too closely at this, but I think you get the general idea.  If the speed of property(itemgetter(index)) is the bottleneck in your code, the starting point is to unwind the exact series of C steps performed to see if any of them can be simplified.  For the most part, the code in property() and itemgetter() were both implemented using the simplest C parts of the C API rather than the fastest.  The chain of calls isn't specialized for the common use case (i.e. property_get() needing exactly 1 argument rather than a variable length arg tuple and itemgetter doing a known integer offset on a list or tuple rather than the less common case of generic types and a possible tuple of indices).   

We should start by optimizing what we've already got.  That will have a benefit beyond named tuples (faster itemgetters for sorting and faster property gets for the entire language).  

It also helps us avoid making the NT code less familiar (using custom private APIs rather than generic, well-known components).  

It also reduces the risk of breaking code that relies on the published implementation of named tuple attribute lookups (for example, I've seen deployed code that customizes the attribute docstrings like this):
   Point = namedtuple('Point', ['x', 'y']) 
   Point.x = property(Point.x.fget, doc='abscissa')
   Point.y = property(Point.y.fget, doc='ordinate')
   coordinate = Point(x=250, y=43)
msg242085 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-26 23:04
One other thought: the itemgetter.__call__ method is already pretty thin:

    if (!PyArg_UnpackTuple(args, "itemgetter", 1, 1, &obj))
	return NULL;
    if (nitems == 1)
        return PyObject_GetItem(obj, ig->item);

But you could add a special case for single integer index being applied to a known sequence.  Extract the Py_ssize_t index in itemgetter_new and store it in the itemgetterobject.  Then add a fast path in itemgetter.__call__.  Roughly something like this:

  if (ig->index != -1 &&
      PyTuple_Check(obj) && 
      nitems == 1 &&
      PyTuple_GET_SIZE(obj) > ig->index) {
       result = PySequence_Fast_GET_ITEM(obj, ig->index);
       Py_INCREF(result);
       return result;
  }
 
Perhaps also add a check to make sure the tuple subclass hasn't overridden the __getitem__ method (see an example of how to do this in the code for Modules/_collectionsmodule.c::_count_elements()).

Something along these lines would allow all the steps to be inlined and would eliminate all the usual indirections inherent in the abstract API.

Another alternative is to use the PySequence API instead of the PyTuple API.  That trades away a little of the speed-up in return for speeding-up itemgetter on lists as well as tuples.
msg242089 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-27 00:44
I was unable to see a performance increase by playing with the itemgetter.__call__ code; however, updating the propery code seemed to show a small improvement. I think that for simple indexing the cost of checking if it is a sequence outways the faster dispatch (when using PySequence_GetItem). I can play with this further.

* default
[joejev@Sheila cpython]$ ./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
10000000 loops, best of 3: 0.101 usec per loop

* patch
[joejev@Sheila cpython]$ ./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
10000000 loops, best of 3: 0.0942 usec per loop
msg242091 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-27 01:30
If you have a chance, run a C profiler so we can see whether most of the time is being spent in an attribute lookup for the current property(itemgetter) approach versus your nt-indexer approach.  Without a profile, I have only my instincts that the overhead is a mix of indirections and function call overhead (both solveable by in-lining), and over-generalization for all PyObject_GetItem() (solvable by specialization to a tuple subclass), and variable length argument lists (solveable by using of PyTuple_New(1)).   

Ideally, I would like something that speeds-up broader swaths of the language and doesn't obfuscate the otherwise clean generated code.  ISTM that the C code for both property() and itemgetter() still have room to optimize the common case.
msg242093 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-27 02:19
This was very exciting, I have never run gprof before; so just to make sure I did this correctly, I will list my steps:

1. compile with the -pg flag set
1. run the test with ./python -m timeit ...
1. $ gprof python gmon.out > profile.out

Here is default:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 22.48      0.87     0.87                             PyEval_EvalFrameEx
  9.82      1.25     0.38                             PyObject_CallFunctionObjArgs
  6.85      1.52     0.27                             PyObject_GenericGetAttr
  6.46      1.77     0.25                             tupledealloc
  5.56      1.98     0.22                             PyArg_UnpackTuple
  5.17      2.18     0.20                             PyNumber_AsSsize_t
  5.17      2.38     0.20                             tuplesubscript
  5.04      2.58     0.20                             PyObject_Call
  4.91      2.77     0.19                             _PyType_Lookup
  4.65      2.95     0.18                             PyTuple_New
  4.65      3.13     0.18                             PyObject_GetItem
  4.39      3.30     0.17                             itemgetter_call
  1.94      3.37     0.08                             PyObject_GetAttr
  1.81      3.44     0.07                             PyObject_GC_UnTrack
  1.81      3.51     0.07                             _PyTuple_DebugMallocStats
  1.03      3.55     0.04                             PyErr_Occurred
  1.03      3.59     0.04                             property_descr_set
  1.03      3.63     0.04                             tuplerepr
  0.78      3.66     0.03                             PyLong_AsSsize_t
  0.78      3.69     0.03                             PyObject_GC_Track
  0.52      3.71     0.02                             property_descr_get
  0.52      3.73     0.02                             repeat_next
  0.52      3.75     0.02                             repeat_traverse
  0.39      3.77     0.02                             PyArg_ValidateKeywordArguments
  0.39      3.78     0.02                             _Py_CheckFunctionResult
  0.26      3.79     0.01                             PyCFunction_NewEx
  0.26      3.80     0.01                             PyCode_New
  0.26      3.81     0.01                             PyErr_Restore
  0.26      3.82     0.01                             PyType_GetSlot
  0.26      3.83     0.01                             _PyObject_CallMethodIdObjArgs
  0.26      3.84     0.01                             attrgetter_new
  0.26      3.85     0.01                             update_one_slot
  0.13      3.86     0.01                             _PyObject_GenericGetAttrWithDict
  0.13      3.86     0.01                             _PyObject_SetAttrId
  0.13      3.87     0.01                             gc_isenabled
  0.13      3.87     0.01                             visit_decref


Here is my patch:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 26.63      1.02     1.02                             PyEval_EvalFrameEx
  8.88      1.36     0.34                             PyObject_GenericGetAttr
  7.83      1.66     0.30                             tupledealloc
  6.27      1.90     0.24                             PyObject_Call
  5.74      2.12     0.22                             PyTuple_New
  5.48      2.33     0.21                             property_descr_get
  5.22      2.53     0.20                             _PyType_Lookup
  4.44      2.70     0.17                             tuplesubscript
  4.18      2.86     0.16                             PyArg_UnpackTuple
  3.92      3.01     0.15                             PyNumber_AsSsize_t
  3.66      3.15     0.14                             PyObject_GetItem
  2.87      3.26     0.11                             itemgetter_call
  2.61      3.36     0.10                             PyObject_GC_UnTrack
  1.70      3.43     0.07                             PyObject_GetAttr
  1.31      3.48     0.05                             repeat_next
  1.31      3.53     0.05                             repeat_traverse
  1.04      3.57     0.04                             attrgetter_new
  1.04      3.61     0.04                             property_descr_set
  0.78      3.64     0.03                             PyErr_Occurred
  0.78      3.67     0.03                             PyErr_Restore
  0.78      3.70     0.03                             PyLong_AsSsize_t
  0.78      3.73     0.03                             PyType_GetSlot
  0.52      3.75     0.02                             PyObject_GC_Track
  0.26      3.76     0.01                             PyArg_ValidateKeywordArguments
  0.26      3.77     0.01                             PyDict_GetItem
  0.26      3.78     0.01                             _PyObject_GenericGetAttrWithDict
  0.26      3.79     0.01                             _PyTuple_DebugMallocStats
  0.26      3.80     0.01                             _Py_CheckFunctionResult
  0.26      3.81     0.01                             convertitem
  0.26      3.82     0.01                             r_object
  0.26      3.83     0.01                             tuplerepr
  0.13      3.83     0.01                             _PyObject_SetAttrId

It looks like you were correct that PyObject_CallFunctionObjArgs was eating up a lot of time.
msg242103 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-27 07:26
Hmm, the presense of _PyTuple_DebugMallocStats,  repeat_traverse, and visit_decref suggests that the profile may have been run with debugging code enabled and GC enabled.

The property patch looks good.  Depending on how far you want to go with this, you could save your tuple of length-1 statically and reuse it on successive calls if its refcnt doesn't grow (see the code for zip() for an example of how to do this).  That would save the PyTuple_New and tupledealloc calls.

Going further, potentially you could in-line some of the code it PyObject_Call, caching the callsite and its NULL check, and looking at the other steps to see if they are all necessary in the context of a property_desc_get().
msg242112 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-27 12:38
I switched to the static tuple.
msg242114 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-27 13:13
I don't think that I can cache the __call__ of the fget object because it might be an instance of a heaptype, and if someone changed the __class__ of the object in between calls to another heaptype that had a different __call__, you would still get the __call__ from the first type. I also don't know if this is supported behavior or just something that works by accident.

I read through PyObject_Call, and all the code is needed assuming we are not caching the tp_call value.
msg242123 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-27 15:42
What kind of speed improvement have you gotten?
msg242124 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-27 15:53
I am currently on a different machine so these numbers are not relative to the others posted earlier.

* default
./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
10000000 loops, best of 3: 0.0699 usec per loop

* patch
./python -m timeit -s "from collections import namedtuple as n;a = n('n', 'a b c')(1, 2, 3)" "a.a"
10000000 loops, best of 3: 0.0468 usec per loop
msg242234 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-29 16:26
>  0.0699 usec per loop --> 0.0468 

That's pretty good for a small patch :-)


For the pre-computed 1-tuple, I think you need to check for a refcnt of 1 and fallback to PyTuple_New if the tuple is in use (i.e. a property that calls another property).  See Objects/enumobject.c::105 for an example.

Also, consider providing a way to clean-up that tuple on shutdown.  For example, look at what is done with the various freelists.  An easier way is to make the premade tuple part of the property object struct so that it gets freed when the property is deallocated.

Adding Serhiy to the nosy list, he can help with cleaning-up the patch so that it is ready to apply.
msg242235 - (view) Author: Joe Jevnik (llllllllll) * Date: 2015-04-29 17:18
I don't think that we need to worry about reusing the single argument tuple in a recursive situation because we never need the value after we start the call. We also just write our new value and then clean up with a NULL to make sure that we don't blow up when we dealloc the tuple. For example:

>>> class C(object):
...     @property
...     def f(self):
...         return D().f
... 
>>> class D(object):
...     @property
...     def f(self):
...         return 1
... 
>>> C().f
1


This works with recursive properties.
I also think that is is getting cleaned up on shutdown, if I put a pointer to garbage in the tuple, the interpreter blows up on shutdown; this makes me think that tuple_dealloc is being called somewhere. About putting the tuple on the property instance, that would nice for memory management; however, that increases the memory overhead of each property. This also means that we would only get the faster lookup after the property has been accessed once; this is fine but the current implementation makes it so that all properties are faster after any of them are looked up once. I could be wrong about the cleanup though.

I am also updating the title and headers because this issue is no longer about namedtuple.
msg242273 - (view) Author: Roundup Robot (python-dev) Date: 2015-04-30 15:08
New changeset 661cdbd617b8 by Raymond Hettinger in branch 'default':
Issue #23910: Optimize property() getter calls.  Patch by Joe Jevnik
https://hg.python.org/cpython/rev/661cdbd617b8
msg243981 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-24 13:12
This patch caused crash with C implementation of lru_cache (issue14373). Opened issue24276 for fixing this.
History
Date User Action Args
2015-05-24 13:12:40serhiy.storchakasetmessages: + msg243981
2015-04-30 15:20:53rhettingersetstatus: open -> closed
resolution: fixed
2015-04-30 15:08:23python-devsetnosy: + python-dev
messages: + msg242273
2015-04-29 17:18:54llllllllllsetmessages: + msg242235
components: + Interpreter Core, - Extension Modules
title: C implementation of namedtuple (WIP) -> property_descr_get reuse argument tuple
2015-04-29 16:26:07rhettingersetnosy: + serhiy.storchaka
messages: + msg242234
2015-04-27 15:53:12llllllllllsetmessages: + msg242124
2015-04-27 15:42:33rhettingersetmessages: + msg242123
2015-04-27 13:13:23llllllllllsetmessages: + msg242114
2015-04-27 12:38:17llllllllllsetfiles: + with-static-tuple.patch

messages: + msg242112
2015-04-27 07:26:48rhettingersetmessages: + msg242103
2015-04-27 02:19:03llllllllllsetmessages: + msg242093
2015-04-27 01:30:57rhettingersetmessages: + msg242091
2015-04-27 00:44:08llllllllllsetfiles: + property.patch

messages: + msg242089
2015-04-26 23:04:43rhettingersetmessages: + msg242085
2015-04-26 21:33:37rhettingersetpriority: normal -> low

messages: + msg242082
2015-04-14 16:25:09llllllllllsetfiles: + namedtuple-indexer-with-news.patch

messages: + msg240934
2015-04-12 20:56:31llllllllllsetmessages: + msg240568
2015-04-12 18:48:04llllllllllsetfiles: + namedtuple-indexer-update.patch

messages: + msg240565
2015-04-12 04:25:54rhettingersetmessages: + msg240535
2015-04-11 07:11:10llllllllllsetfiles: + namedtuple-indexer.patch

messages: + msg240465
2015-04-11 05:59:50rhettingersetassignee: rhettinger
2015-04-11 04:32:57serhiy.storchakasetnosy: + rhettinger
2015-04-10 23:44:57eric.snowsetnosy: + eric.snow
2015-04-10 22:25:34eric.smithsetmessages: + msg240452
2015-04-10 21:54:44llllllllllsetmessages: + msg240451
2015-04-10 21:48:37barrysetnosy: + barry
2015-04-10 21:29:37eric.smithsetmessages: + msg240450
2015-04-10 21:20:33llllllllllsetmessages: + msg240449
2015-04-10 20:44:10eric.smithsetnosy: + eric.smith
messages: + msg240448
2015-04-10 20:10:37llllllllllcreate