> The patch looks fine, but it looks a bit like benchmark chasing. Is the speed of builtin sum() of a sequence of integers important enough to do this bit of inlining?

Given that we already accepted essentially separate loops for the int, float and everything else cases, I think the answer is that it doesn't add much to the triplication.

> It may break if we change the internals of Py_Long, as Mark Shannon has been wanting to do for a while

I would assume that such a structural change would come with suitable macros to unpack the special 0-2 digit integers. Those would then apply here, too. As it stands, there are already some modules distributed over the source tree that use direct digit access: ceval.c, _decimal.c, marshal.c. They are easy to find with grep and my PR just adds one more.
