I tried re-inlining the fast path from _PyType_Lookup() in object.c and found no measurable improvement on the simple benchmarks I tried.  I've also stress-tested the patch by disabling the fast-path return, always performing the slow-path lookup, and asserting that the cached result matches the slow-path result.  I then ran that modified interpreter on the Python test-suite, various benchmarks, and a range of my own applications.  While not a formal proof of correctness, it was encouraging that the cache remained consistent.
