FWIW, the inefficiency is only in the loop setup, the time to call
reversed() and __reversed__().  The inner loop runs at the same speed
because xrange provides a __reversed__ iterator.

Please do not go through the standard library making these minor tweaks
without making sure there is a significant measured speed-up and do
consider the readability issue.
