Message 105585 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	spiv
Recipients	belopolsky, rhettinger, spiv
Date	2010-05-12.13:14:54
SpamBayes Score	0.0042824773
Marked as misclassified	No
Message-id	<1273670097.33.0.157851393088.issue8685@psf.upfronthosting.co.za>
In-reply-to

Content
Regarding memory, good question... but this patch turns out to be an improvement there too. This optimisation only applies when len(x) > len(y) * 4. So the minimum size of the result is a set with 3/4 of the elems of x (and possibly would be a full copy of x anyway). So if you like this optimisation is simply taking advantage of the fact we're going to be copying almost all of these elements anyway. We could make it less aggressive, but large sets are tuned to be between 1/2 and 1/3 empty internally anyway, so 1/4 overhead seems reasonable. Also, because this code immediately makes the result set be about the right size, rather than growing it one element at a time, the memory consumption is actually better. I'll attach a script that demonstrates this; for me it shows that large_set.difference(small_set) [where large_set has 4M elems, small_set has 100] peaks at 50MB memory consumption without my patch, but only 18MB with. (after discounting the memory required for large_set itself, etc.)

Regarding memory, good question... but this patch turns out to be an improvement there too.

This optimisation only applies when len(x) > len(y) * 4.  So the minimum size of the result is a set with 3/4 of the elems of x (and possibly would be a full copy of x anyway).

So if you like this optimisation is simply taking advantage of the fact we're going to be copying almost all of these elements anyway.  We could make it less aggressive, but large sets are tuned to be between 1/2 and 1/3 empty internally anyway, so 1/4 overhead seems reasonable.

Also, because this code immediately makes the result set be about the right size, rather than growing it one element at a time, the memory consumption is actually *better*.  I'll attach a script that demonstrates this; for me it shows that large_set.difference(small_set) [where large_set has 4M elems, small_set has 100] peaks at 50MB memory consumption without my patch, but only 18MB with.  (after discounting the memory required for large_set itself, etc.)

History
Date	User	Action	Args
2010-05-12 13:14:57	spiv	set	recipients: + spiv, rhettinger, belopolsky
2010-05-12 13:14:57	spiv	set	messageid: <1273670097.33.0.157851393088.issue8685@psf.upfronthosting.co.za>
2010-05-12 13:14:56	spiv	link	issue8685 messages
2010-05-12 13:14:54	spiv	create