Issue 32846: Deletion of large sets of strings is extra slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77027

classification

Title:	Deletion of large sets of strings is extra slow
Type:	performance	Stage:	resolved
Components:		Versions:	Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	lemburg, luis@luispedro.org, methane, rhettinger, serhiy.storchaka, terry.reedy, tim.peters, twouters, vstinner
Priority:	normal	Keywords:

Created on 2018-02-15 00:23 by terry.reedy, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bench_del.py	serhiy.storchaka, 2018-02-15 15:24
time-set.py	luis@luispedro.org, 2018-02-16 15:47

Messages (20)
msg312188 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-15 00:23
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/, a blog post on cpython speed, clains "deleting a set of 1 billion strings takes >12 hours". (No other details provided.) I don't have the 100+ gigabytes of ram needed to confirm this, but with installed 64 bit 3.7.0b1 with Win10 and 12 gigabyes, I confirmed that there is a pronounced super-linear growth in string set deletion (unlike with an integer set). At least half of ram was available. Seconds to create and delete sets millions integers strings of items create delete create delete 1 .08 .02 .36 .08 2 .15 .03 .75 .17 4 .30 .06 1.55 .36 8 .61 .12 3.18 .76 16 1.22 .24 6.48 1.80 < slightly more than double 32 2.4 .50 13.6 5.56 < more than triple 64 4.9 1.04 28 19 < nearly quadruple 128 10.9 2.25 <too large> 100 56 80 < quadruple with 1.5 x size For 100 million strings, I got about the same 56 and 80 seconds when timing with a clock, without the timeit gc suppression. I interrupted the 128M string run after several minutes. Even if there is swapping to disk during creation, I would not expect it during deletion. The timeit code: import timeit for i in (1,2,4,8,16,32,64,128): print(i, 'int') print(timeit.Timer(f's = {{n for n in range({i}1000000)}}') .timeit(number=1)) print(timeit.Timer('del s', f's = {{n for n in range({i}1000000)}}') .timeit(number=1)) for i in (1,2,4,8,16,32,64,100): print(i, 'str') print(timeit.Timer(f's = {{str(n) for n in range({i}1000000)}}') .timeit(number=1)) print(timeit.Timer('del s', f's = {{str(n) for n in range({i}1000000)}}') .timeit(number=1)) Raymond, I believe you monitor the set implementation, and I know Victor is interested in timing and performance.
msg312209 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-02-15 14:25
For lists and dicts the time of deletion is virtually linear. 1 int 5.2 9.8 2 int 5.8 9.7 4 int 5.6 9.8 8 int 5.8 10.0 16 int 5.5 9.7 32 int 5.4 9.6 64 int 5.6 9.0 128 int 5.6 8.7 1 str 7.6 13.3 2 str 8.0 13.8 4 str 7.4 14.0 8 str 7.5 14.3 16 str 7.6 14.4 32 str 8.0 14.7 64 str 7.9 14.3 100 str 8.1 14.0 1 int 60.4 10.5 2 int 61.0 11.1 4 int 61.1 10.4 8 int 60.1 11.1 16 int 61.1 10.1 32 int 61.5 9.9 64 int 60.7 9.6 128 int 60.8 9.4 1 str 204.9 15.4 2 str 247.4 15.8 4 str 275.3 16.1 8 str 299.3 14.8 16 str 312.8 14.8 32 str 329.2 14.2 64 str 344.5 14.1 100 str 386.2 14.4 (third and forth columns are time in nanoseconds per item, for creation and deletion) But when shuffle the input data before creating a collection, the deletion time becomes superlinear (and much slower). 1 int 17.4 38.9 2 int 19.1 41.3 4 int 24.4 46.3 8 int 28.0 49.5 16 int 28.1 53.4 32 int 29.8 67.2 64 int 35.2 90.6 128 int 42.6 143.1 1 str 21.1 56.3 2 str 24.3 58.2 4 str 27.1 63.6 8 str 27.3 73.9 16 str 29.4 99.2 32 str 34.3 144.8 64 str 41.8 229.5 100 str 46.3 338.4 In former examples (list, ordered dict) objects are iterated and deleted in order of creation. But in the set of strings items are deleted in non-predicable random order. I suppose the non-linear effect is related to memory management.
msg312215 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-02-15 16:53
> In former examples (list, ordered dict) objects are iterated and deleted >in order of creation. But in the set of strings items are deleted in > non-predicable random order. I suppose the non-linear effect is > related to memory management. That makes sense. We're likely seeing an artifact of a favorable correspondence between the container pointer order and the order that the referred-to objects were created in memory in this somewhat non-representative test case. That is why lists lose the favorable timing when the data is shuffled. If the test bench creates all the referred-to objects in consecutive memory locations, then any container accessing those objects consecutively will benefit from spatial cache locality. (i.e. If there is more than one datum per cache line, the second read is virtually free. Likewise, memory controller read-ahead makes a long series of consecutive accesses cheaper.) It would be nice is the timing were run on strings instead of integers. Ints are somewhat weird in that the hashes of consecutive integers are themselves consecutive, you can fit two ints in one cache line. I'm not sure whether it applies here, but there may also be a branch-prediction effect as well: https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array FWIW, the deallocation routine for sets isn't doing anything special. It just calls Py_DECREF in a loop: static void set_dealloc(PySetObject so) { setentry entry; Py_ssize_t used = so->used; /* bpo-31095: UnTrack is needed before calling any callbacks / PyObject_GC_UnTrack(so); Py_TRASHCAN_SAFE_BEGIN(so) if (so->weakreflist != NULL) PyObject_ClearWeakRefs((PyObject ) so); for (entry = so->table; used > 0; entry++) { if (entry->key && entry->key != dummy) { used--; Py_DECREF(entry->key); } } if (so->table != so->smalltable) PyMem_DEL(so->table); Py_TYPE(so)->tp_free(so); Py_TRASHCAN_SAFE_END(so) }
msg312216 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-02-15 17:34
The 4th graph in this article may be showing the reason for the growth in random access times as the size gets bigger: https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
msg312226 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-02-16 05:02
Random idea: For some range of array sizes (bigger than L3 cache), there might be a net benefit to sorting L1-sized clumps of pointers before making the Py_DECREF calls. Possibly, the cost of sorting would be offset by improved memory access patterns. On the other hand, this might just optimize an artificial benchmark that isn't representative of real code where the data is actually being used. In that case, the program is already going to have to hop all over memory just to access the referred-to objects.
msg312236 - (view)	Author: Luis Pedro Coelho (luis@luispedro.org)	Date: 2018-02-16 15:47
Original poster here. The benchmark is artificial, but the problem setting is not. I did have a problem that is roughly: interesting = set(line.strip() for line in open(...)) for line in open(...): key,rest = line.split('\t', 1) if key in interesting: process(rest) Deleting the set (when it goes out of scope) was a significant chunk of the time. Surprisingly, deleting a very large set takes much longer than creating it. Here are my controlled measurements (created with the attached script, which itself uses Jug http://jug.rtfd.io and assumes a file `input.txt` is present). N create (s) delete (s) 1 0.00 0.00 10 0.00 0.00 100 0.00 0.00 1000 0.00 0.00 10000 0.01 0.00 100000 0.15 0.01 1000000 1.14 0.12 10000000 11.44 2.24 100000000 126.41 70.34 200000000 198.04 258.44 300000000 341.27 646.81 400000000 522.70 1044.97 500000000 564.07 1744.54 600000000 1380.04 3364.06 700000000 1797.82 3300.20 800000000 1294.20 4410.22 900000000 3045.38 7646.59 1000000000 3467.31 11042.97 1100000000 5162.35 13750.22 1200000000 6581.90 12544.67 1300000000 1612.60 17204.67 1400000000 1788.13 23772.84 1500000000 6922.16 25068.49
msg312249 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2018-02-16 19:12
Thanks for the info. Here is what I surmise from the data. * The shape of the curve matches the expected memory latency curve for random memory accesses for a given working set size. The graph of your measurements looks similar in shape to this one: https://www.extremetech.com/wp-content/uploads/2014/08/latency.png * The last line of the measurements has N=1,500,000,000. That would give an approx 70GB setobject that refers to another 100Gb (or more) of strings. Unless you are working with an extraordinary machine, most of that data will have been swapped to disk. * Dividing the 1,500,000,000 elements by the time of 25,068 seconds gives about 60,000 deletions per second. This number coincides nicely with the random access performance of a typical SSD rated at 50,000 IOPs. If this reasoning is sound, all it proves is that random accesses to virtual memory are limited by the speed of the hardware for that virtual memory.
msg312250 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-02-16 19:57
Luís, what you really need for the problem you outline is an immutable set with one operation, 'in', aside from create and delete. If specialized to strings only, it could internally stores only character sequences, rather than Python objects. I suspect someone has written such a class (in C), but did not find anything on pypi.python.org. Lacking that, the following experiment suggests that you might be able to restore near O(n) time by partitioning the set of keys into a collection of sets. My 16M set took 6.48 and 1.80 seconds. Times 4 is 25.9 and 7.2. My 64M set took 28 and 19 seconds, but 4 16M sets take 26.3 and 7.5 seconds, only about 5% more than the x4 target. print(timeit.Timer( 's = [{str(n) for n in range(i1000000, (i+16)1000000)}' ' for i in (0, 16, 32, 48)]') .timeit(number=1)) print(timeit.Timer( 'del s', 's = [{str(n) for n in range(i1000000, (i+16)1000000)}' ' for i in (0, 16, 32, 48)]') .timeit(number=1))
msg312263 - (view)	Author: Tim Peters (tim.peters) *	Date: 2018-02-17 02:22
> Surprisingly, deleting a very large set takes much longer than creating it. Luis, that's not surprising ;-) When you create it, it's mostly the case that there's a vast chunk of raw memory from which many pieces are passed out in address order (to hold all the newly created Python objects). Memory access is thus mostly sequential. But when you delete it, that vast chunk of once-raw memory is visited in essentially random order (string hashes impose a pseudo-random order on where (pointers to) string objects are stored within a set's vector), defeating all the hardware features that greatly benefit from sequential access. More precisely, the set's internal vector is visited sequentially during deletion, but the string objects the pointers point _at_ are all over the place. Even if nothing is swapped to disk, it's likely that visiting a string object during deletion will miss on all cache levels, falling back to (much slower) RAM. Note that all the string objects must be visited during set deletion, in order to decrement their reference counts.
msg312264 - (view)	Author: Inada Naoki (methane) *	Date: 2018-02-17 02:59
One possible advice; try dict instead of set. Dict is (surprisingly) little smaller than set. And dict cleans items in insertion order when the dict is deleted.
msg312265 - (view)	Author: Inada Naoki (methane) *	Date: 2018-02-17 03:15
> Dict is (surprisingly) little smaller than set. I'm sorry, I was wrong. dict is smaller than set only when len(d) is small. (typical case for namespace dict) In case of massive keys, dict is larger than set.
msg312278 - (view)	Author: Luis Pedro Coelho (luis@luispedro.org)	Date: 2018-02-17 12:09
I think some of this conversation is going off-topic, but there is no disk-swapping in my case. I realize ours is not a typical setup, but our normal machines have 256GB of RAM and the "big memory" compute nodes are >=1TB. Normally, swap is outright disabled. This really is an impressive case study on how much difference cache-locality can make.
msg312280 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2018-02-17 13:44
Reminds of some experiments someone did a while ago as part of the GIL removal attempts where the ref count integers are all kept in a separate array. The intent there was to be able to do locking on a single array rather than on individual decref cells. This would solve the issue with having to jump around in memory to decref all objects, but I'm not sure whether the overall win would be a lot, since deallocation of the memory blocks typically requires accessing the block itself as well (to update the block chain list pointers), unless the memory allocator uses some smart cache local block management as well (I believe that pymalloc does, but could be wrong). In any case, this sounds like a fun experiment for a GSoC student. Perhaps the PSF could donate an AWS EC2 instance with enough RAM to do the experiments.
msg312350 - (view)	Author: Inada Naoki (methane) *	Date: 2018-02-19 13:28
@Luis, would you try dict instead of set? It's little larger than set, but delete elements by insertion order. But I don't think builtin data structure can be optimized for such workload. Maybe, LMBD[1] or some other KVS can help you. [1]: https://lmdb.readthedocs.io/en/release/
msg337655 - (view)	Author: Inada Naoki (methane) *	Date: 2019-03-11 10:54
I thought compact set implementation similar to dict sicne Python 3.6 may fix this issue. But as discussion with Raymond on Python-Dev ML, I conclude the merits of compact implementation is not significant enough. I abandoned compact set branch. Now I don't have any idea to fix this issue in foreseeable future. Please use dict instead.
msg345652 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-06-15 01:52
Looks likely that the _major_ cause of the quadratic-time delete behavior was due to that obmalloc used a linear-time method to keep its linked list of usable arenas sorted in order of number of free pools. When a pool became unused, its arena's count of free pools increased by one, and then order was restored by moving the arena "to the right" in the linked list, one slot at a time. When there were only a few hundred arenas, nobody noticed. But programs with thousands of arenas could suffer very noticeable sloth, and programs with hundreds of thousands could appear to hang (could require days to complete deleting). This was fixed in issue #37029, using a constant-time method to restore sorted order, scheduled for release in Python 3.8.
msg345653 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-06-15 02:01
Can we close this now? ISTM the issue has less to do with sets and more to do with memory allocation quirks and that on modern CPUs random memory accesses are slower than sequential memory accesses. It is not a bug that sets are unordered collections that iterate over elements in a difference order than they were inserted.
msg345654 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-06-15 02:12
Raymond, please read my very recent comment one above yours. A (overall) quadratic-time algorithm (O(A**2) where A is the number of arenas) in obmalloc.c is (to my eyes) probably the _primary_ cause of the sloth here. That's been fixed for 3.8, but I don't have enough RAM even to run Terry's test code to confirm it. I can confirm that there's no quadratic-time behavior anymore deleting large sets of strings, but only until I run out of RAM. Neither is the behavior linear, but it's much closer to linear than quadratic now. If someone with more RAM can confirm this for larger sets, then fine by me if this is closed again.
msg345657 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2019-06-15 04:25
I reran the code in msg312188. Ints as before, string deletion +- linear up to 100 million, much better than before. millions old strings new strings of items create delete create delete 4 1.55 .36 1.7 .38 8 3.18 .76 3.6 .78 16 6.48 1.80 7.3 1.71 32 13.6 5.56 14.8 3.8 64 28 19 30 8.4 100 56 80 50 14.3
msg345720 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-06-16 00:38
Thanks, Terry! Based on your latest results, "quadratic time" isn't plausible here anymore, so I'm closing this. Nasty cache effects certainly played a role, but they were just a flea on the dog ;-)

History
Date	User	Action	Args
2022-04-11 14:58:57	admin	set	github: 77027
2019-06-16 00:38:45	tim.peters	set	status: open -> closed resolution: fixed messages: + msg345720 stage: commit review -> resolved
2019-06-15 04:25:07	terry.reedy	set	messages: + msg345657
2019-06-15 02:12:43	tim.peters	set	messages: + msg345654
2019-06-15 02:01:45	rhettinger	set	messages: + msg345653
2019-06-15 01:53:54	tim.peters	set	stage: resolved -> commit review
2019-06-15 01:52:51	tim.peters	set	stage: resolved messages: + msg345652 versions: + Python 3.8, - Python 3.9
2019-06-15 00:08:37	methane	set	status: closed -> open stage: resolved -> (no value) resolution: wont fix -> (no value) versions: + Python 3.9, - Python 3.7, Python 3.8
2019-03-11 10:54:14	methane	set	status: open -> closed resolution: wont fix messages: + msg337655 stage: needs patch -> resolved
2018-02-19 13:28:39	methane	set	messages: + msg312350
2018-02-17 13:44:57	lemburg	set	messages: + msg312280
2018-02-17 12:09:45	luis@luispedro.org	set	messages: + msg312278
2018-02-17 03:15:59	methane	set	messages: + msg312265
2018-02-17 02:59:24	methane	set	messages: + msg312264
2018-02-17 02:22:26	tim.peters	set	messages: + msg312263
2018-02-16 19:57:30	terry.reedy	set	messages: + msg312250
2018-02-16 19:12:39	rhettinger	set	messages: + msg312249
2018-02-16 15:47:44	luis@luispedro.org	set	files: + time-set.py nosy: + luis@luispedro.org messages: + msg312236
2018-02-16 05:02:15	rhettinger	set	messages: + msg312226
2018-02-15 17:34:48	rhettinger	set	messages: + msg312216
2018-02-15 16:53:46	rhettinger	set	messages: + msg312215
2018-02-15 15:24:36	serhiy.storchaka	set	files: + bench_del.py
2018-02-15 14:57:27	methane	set	nosy: + methane
2018-02-15 14:25:51	serhiy.storchaka	set	nosy: + lemburg, tim.peters, twouters messages: + msg312209
2018-02-15 12:28:41	serhiy.storchaka	set	nosy: + serhiy.storchaka
2018-02-15 00:23:33	terry.reedy	create