"Theoretically, an object type that consistently allocates more than the small object threshold would perform a bit slower because it would first jump to the small object allocator, do the size comparison and then jump to malloc."

I expect that the cost of the extra check is *very* cheap (completly negligible) compared to the cost of a call to malloc().

To have an idea of the cost of the Python code around system allocators, you can take a look at the Performance section of my PEP 445 which added an indirection to all Python allocators:

I was unable to measure an overhead on macro benchmarks ( The overhead on microbenchmarks was really hard to measure because it was so low that benchmarks were very unable.
