Message 241179 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	larry
Recipients	benjamin.peterson, larry, marmoute, mpm
Date	2015-04-15.23:35:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1429140944.27.0.450320519696.issue23971@psf.upfronthosting.co.za>
In-reply-to

Content
Matt Mackall brought this over to me today. If list "l" has one million pairs and you do dict(l) it creates the dict, then resizes it to a million elements, then starts adding name/value pairs. But since it's sized to a million elements, it gets beyond its "fill ratio" comfort level of 666k and resizes. And when you're up to a million elements, the resize is quite slow. It should have internally resized the dict correctly from the beginning such that no resize was necessary. The same is true for dict.fromkeys--passed in a list of 1m elements, it uses an initial size of 1m, not 1.5m. Attached is a sample patch to change the behavior of both these operations so no resizes are needed during the insertions. I haven't checked that it actually fixes the behavior, I just made the change and ran the regression test. (I'm oversubscribed here at the sprints, and am kind of rushing this diff out just to get the conversation rolling.) Benjamin: what do you think? Would it be appropriate to make a change like this in 2.7? Python 3 (or at least 3.5 trunk) is fancy here. It starts with the minimum size of a dict, then iterates until the new size is > 1.5*the number of elements. This means the dicts it creates are of the same size as they would be if we'd started with a minsize dict and inserted one element at a time. This might help minimize wear and tear on the small block allocator for smaller dicts. BTW, the internal function _PyDict_NewPresize should really have taken the fill ratio into account too. But, even though it's an internal-only function, it's probably too late to change its behavior.

Matt Mackall brought this over to me today.

If list "l" has one million pairs and you do
dict(l)
it creates the dict, then resizes it to a million elements, then starts adding name/value pairs. But since it's sized to a million elements, it gets beyond its "fill ratio" comfort level of 666k and resizes. And when you're up to a million elements, the resize is quite slow. It should have internally resized the dict correctly from the beginning such that no resize was necessary.

The same is true for dict.fromkeys--passed in a list of 1m elements, it uses an initial size of 1m, not 1.5m.

Attached is a sample patch to change the behavior of both these operations so no resizes are needed during the insertions. I haven't checked that it actually fixes the behavior, I just made the change and ran the regression test. (I'm oversubscribed here at the sprints, and am kind of rushing this diff out just to get the conversation rolling.)

Benjamin: what do you think? Would it be appropriate to make a change like this in 2.7?

Python 3 (or at least 3.5 trunk) is fancy here. It starts with the minimum size of a dict, then iterates until the new size is > 1.5*the number of elements. This means the dicts it creates are of the same size as they would be if we'd started with a minsize dict and inserted one element at a time. This might help minimize wear and tear on the small block allocator for smaller dicts.

BTW, the internal function _PyDict_NewPresize should really have taken the fill ratio into account too. But, even though it's an internal-only function, it's probably too late to change its behavior.

History
Date	User	Action	Args
2015-04-15 23:35:44	larry	set	recipients: + larry, benjamin.peterson, mpm, marmoute
2015-04-15 23:35:44	larry	set	messageid: <1429140944.27.0.450320519696.issue23971@psf.upfronthosting.co.za>
2015-04-15 23:35:44	larry	link	issue23971 messages
2015-04-15 23:35:44	larry	create