Fixed in trunk with r73564.

I performed performance tests: differences with pybench were negligible 
(<1%), but a specially crafted case like:
   kw = dict(a=1, b=2, c=3)
   for x in xrange(self.rounds):
showed an improvement of 21%!

Will backport to various branches after 3.1 is out.
