Issue 2607: why is (default)dict so slow on tuples???

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46859

classification

Title:	why is (default)dict so slow on tuples???
Type:	performance	Stage:
Components:		Versions:	Python 2.5

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, belopolsky, eisele, loewis, rhettinger
Priority:	normal	Keywords:

Created on 2008-04-10 16:21 by eisele, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
a.py	loewis, 2008-04-10 19:15

Messages (10)
msg65293 - (view)	Author: Andreas Eisele (eisele)	Date: 2008-04-10 16:21
I need to count pairs of strings, and I use a defaultdict in a construct like count[a,b] += 1 I am able to count 50K items per second on a very fast machine, which is way too slow for my application. If I count complete strings like count[ab] += 1 it can count 500K items/second, which is more reasonable. I don't see why there is a performance penalty of a factor of 10 for such a simple construct. Do I have to switch to Perl or C to get this done??? Thanks a lot for any insight on this. Best regards, Andreas PS.: The problem seems to exist for ordinary dicts as well, it is not related to the fact that I use a defaultdict PPS: I also tried nested defaultdicts count[a][b] += 1 and get the same slow speed (and 50% more memory consumption)
msg65296 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-04-10 16:37
My guess is that this is due to the fact that strings cache their hash values while tuples don't. IIRC, caching tuple's hash values has been proposed and rejected some time ago. Furthermore, dict lookups are heavily optimized for the case when all keys are strings.
msg65309 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-04-10 19:15
I cannot reproduce this. The attached script for me gives for 1M iterations a total time of 0.8s, on 3.2GHz Pentium 4.
msg65313 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-04-10 19:49
Questions about relative performance should be directed to comp.lang.python. Since no bug or suspect behavior was reported, closing as not-a-bug.
msg65332 - (view)	Author: Andreas Eisele (eisele)	Date: 2008-04-11 02:46
Sorry for not giving a good example in the first place. The problem seems to appear only in the presence of sufficiently many distinct tuples. Then I see performance that looks rather like O(nn) Here is an example that shows the problem: >>> from time import clock >>> d = {} >>> t0 = clock() >>> for i in range(5): for j in range(i1000000,(i+1)*1000000): d[str(j),str(j)]=j print clock()-t0 13.04 39.51 81.86 134.18 206.66 >>> The same example with str(j)+str(j) works fine. Sorry if this should be a non-issue. For me it is a reason to implement functionality in C or Perl that I would really love to do in Python. I would call such a thing a performance bug, but maybe I'm just too demanding... Best regards, Andreas
msg65341 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-04-11 08:44
The slowdown is because of the garbage collector, which has more and more objects to traverse (the tuples). If I add "import gc; gc.disable()" at the beginning of your script, it runs much faster, and the timings look linear. Martin's sample is not affected, because there are very few deallocations, and the gc collection is not triggered. Disabling the gc may not be a good idea in a real application; I suggest you to play with the gc.set_threshold function and set larger values, at least while building the dictionary. (700, 1000, 10) seems to yield good results.
msg65343 - (view)	Author: Andreas Eisele (eisele)	Date: 2008-04-11 09:18
Great, that really solves my problem. Thank you so much, Amaury! As you say, the problem is unrelated to dicts, and I observe it also when including the tuples to a set or keeping them in lists. Perhaps your GC thresholds would be better default values than what is currently in force.
msg65374 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-04-11 21:27
> Perhaps your GC thresholds would be better default > values than what is currently in force. No, the defaults are correct for typical applications.
msg65388 - (view)	Author: Andreas Eisele (eisele)	Date: 2008-04-12 04:01
Even if they mean that creation of a huge number N of objects requires O(N*N) effort?
msg65390 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-04-12 04:20
This discussion ought to be moved to comp.lang.python. The timing script needs work. It doesn't isolate any one issue for discussion. It is affected by GC, dict resizing, details of creating and hashing string objects, the need to periodically call the system malloc, etc. You get different results if: * a new dictionary is created on each pass * if any of the objects get reused * if range(5) gets replaces with [0]5 if you use longs instead of ints * if the strings are of fixed length: str(j)[:5] This is the wrong forum to explain every nuance of what affects performance and whether you think perl is better.

History
Date	User	Action	Args
2022-04-11 14:56:33	admin	set	github: 46859
2008-04-12 04:20:19	rhettinger	set	messages: + msg65390
2008-04-12 04:01:43	eisele	set	messages: + msg65388
2008-04-11 21:27:28	loewis	set	messages: + msg65374
2008-04-11 09:18:05	eisele	set	messages: + msg65343
2008-04-11 08:44:36	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg65341
2008-04-11 02:46:14	eisele	set	messages: + msg65332
2008-04-10 19:49:49	rhettinger	set	status: open -> closed nosy: + rhettinger messages: + msg65313
2008-04-10 19:15:47	loewis	set	files: + a.py
2008-04-10 19:15:23	loewis	set	resolution: works for me messages: + msg65309 nosy: + loewis
2008-04-10 16:37:51	belopolsky	set	nosy: + belopolsky messages: + msg65296
2008-04-10 16:21:40	eisele	create