classification
Title: Deletion of large sets of strings is extra slow
Type: performance Stage: resolved
Components: Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: inada.naoki, lemburg, luis@luispedro.org, rhettinger, serhiy.storchaka, terry.reedy, tim.peters, twouters, vstinner
Priority: normal Keywords:

Created on 2018-02-15 00:23 by terry.reedy, last changed 2019-03-11 10:54 by inada.naoki. This issue is now closed.

Files
File name Uploaded Description Edit
bench_del.py serhiy.storchaka, 2018-02-15 15:24
time-set.py luis@luispedro.org, 2018-02-16 15:47
Messages (15)
msg312188 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-15 00:23
https://metarabbit.wordpress.com/2018/02/05/pythons-weak-performance-matters/, a blog post on cpython speed, clains "deleting a set of 1 billion strings takes >12 hours".  (No other details provided.)

I don't have the 100+ gigabytes of ram needed to confirm this, but with installed 64 bit 3.7.0b1 with Win10 and 12 gigabyes, I confirmed that there is a pronounced super-linear growth in string set deletion (unlike with an integer set).  At least half of ram was available.

      Seconds to create and delete sets
millions    integers        strings
of items  create delete  create delete
   1         .08    .02     .36    .08
   2         .15    .03     .75    .17
   4         .30    .06    1.55    .36
   8         .61    .12    3.18    .76
  16        1.22    .24    6.48   1.80  < slightly more than double
  32        2.4     .50   13.6    5.56  < more than triple
  64        4.9    1.04   28     19     < nearly quadruple
 128       10.9    2.25    <too large>
 100                      56     80     < quadruple with 1.5 x size

For 100 million strings, I got about the same 56 and 80 seconds when timing with a clock, without the timeit gc suppression.  I interrupted the 128M string run after several minutes.  Even if there is swapping to disk during creation, I would not expect it during deletion.

The timeit code:

import timeit

for i in (1,2,4,8,16,32,64,128):
    print(i, 'int')
    print(timeit.Timer(f's = {{n for n in range({i}*1000000)}}')
          .timeit(number=1))
    print(timeit.Timer('del s', f's = {{n for n in range({i}*1000000)}}')
          .timeit(number=1))

for i in (1,2,4,8,16,32,64,100):
    print(i, 'str')
    print(timeit.Timer(f's = {{str(n) for n in range({i}*1000000)}}')
          .timeit(number=1))
    print(timeit.Timer('del s', f's = {{str(n) for n in range({i}*1000000)}}')
          .timeit(number=1))

Raymond, I believe you monitor the set implementation, and I know Victor is interested in timing and performance.
msg312209 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-15 14:25
For lists and dicts the time of deletion is virtually linear.

1 int   5.2   9.8
2 int   5.8   9.7
4 int   5.6   9.8
8 int   5.8   10.0
16 int   5.5   9.7
32 int   5.4   9.6
64 int   5.6   9.0
128 int   5.6   8.7
1 str   7.6   13.3
2 str   8.0   13.8
4 str   7.4   14.0
8 str   7.5   14.3
16 str   7.6   14.4
32 str   8.0   14.7
64 str   7.9   14.3
100 str   8.1   14.0

1 int   60.4   10.5
2 int   61.0   11.1
4 int   61.1   10.4
8 int   60.1   11.1
16 int   61.1   10.1
32 int   61.5   9.9
64 int   60.7   9.6
128 int   60.8   9.4
1 str   204.9   15.4
2 str   247.4   15.8
4 str   275.3   16.1
8 str   299.3   14.8
16 str   312.8   14.8
32 str   329.2   14.2
64 str   344.5   14.1
100 str   386.2   14.4

(third and forth columns are time in nanoseconds per item, for creation and deletion)

But when shuffle the input data before creating a collection, the deletion time becomes superlinear (and much slower).

1 int   17.4   38.9
2 int   19.1   41.3
4 int   24.4   46.3
8 int   28.0   49.5
16 int   28.1   53.4
32 int   29.8   67.2
64 int   35.2   90.6
128 int   42.6   143.1
1 str   21.1   56.3
2 str   24.3   58.2
4 str   27.1   63.6
8 str   27.3   73.9
16 str   29.4   99.2
32 str   34.3   144.8
64 str   41.8   229.5
100 str   46.3   338.4

In former examples (list, ordered dict) objects are iterated and deleted in order of creation. But in the set of strings items are deleted in non-predicable random order. I suppose the non-linear effect is related to memory management.
msg312215 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-02-15 16:53
> In former examples (list, ordered dict) objects are iterated and deleted 
>in order of creation. But in the set of strings items are deleted in
> non-predicable random order. I suppose the non-linear effect is 
> related to memory management.

That makes sense.  We're likely seeing an artifact of a favorable correspondence between the container pointer order and the order that the referred-to objects were created in memory in this somewhat non-representative test case.  That is why lists lose the favorable timing when the data is shuffled.

If the test bench creates all the referred-to objects in consecutive memory locations, then any container accessing those objects consecutively will benefit from spatial cache locality.  (i.e. If there is more than one datum per cache line, the second read is virtually free.  Likewise, memory controller read-ahead makes a long series of consecutive accesses cheaper.)

It would be nice is the timing were run on strings instead of integers.  Ints are somewhat weird in that the hashes of consecutive integers are themselves consecutive, you can fit two ints in one cache line. 

I'm not sure whether it applies here, but there may also be a branch-prediction effect as well:  https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array

FWIW, the deallocation routine for sets isn't doing anything special.  It just calls Py_DECREF in a loop:

static void
set_dealloc(PySetObject *so)
{
    setentry *entry;
    Py_ssize_t used = so->used;

    /* bpo-31095: UnTrack is needed before calling any callbacks */
    PyObject_GC_UnTrack(so);
    Py_TRASHCAN_SAFE_BEGIN(so)
    if (so->weakreflist != NULL)
        PyObject_ClearWeakRefs((PyObject *) so);

    for (entry = so->table; used > 0; entry++) {
        if (entry->key && entry->key != dummy) {
                used--;
                Py_DECREF(entry->key);
        }
    }
    if (so->table != so->smalltable)
        PyMem_DEL(so->table);
    Py_TYPE(so)->tp_free(so);
    Py_TRASHCAN_SAFE_END(so)
}
msg312216 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-02-15 17:34
The 4th graph in this article may be showing the reason for the growth in random access times as the size gets bigger:

https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu-caches-work-and-why-theyre-an-essential-part-of-modern-chips
msg312226 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-02-16 05:02
Random idea:  For some range of array sizes (bigger than L3 cache), there might be a net benefit to sorting L1-sized clumps of pointers before making the Py_DECREF calls.  Possibly, the cost of sorting would be offset by improved memory access patterns.

On the other hand, this might just optimize an artificial benchmark that isn't representative of real code where the data is actually being used.  In that case, the program is already going to have to hop all over memory just to access the referred-to objects.
msg312236 - (view) Author: Luis Pedro Coelho (luis@luispedro.org) Date: 2018-02-16 15:47
Original poster here.

The benchmark is artificial, but the problem setting is not. I did have a problem that is roughly:

    interesting = set(line.strip() for line in open(...))
    for line in open(...):
        key,rest = line.split('\t', 1)
        if key in interesting:
             process(rest)

Deleting the set (when it goes out of scope) was a significant chunk of the time. Surprisingly, deleting a very large set takes much longer than creating it.

Here are my controlled measurements (created with the attached script, which itself uses Jug http://jug.rtfd.io and assumes a file `input.txt` is present).


N                create (s)     delete (s)
           1         0.00         0.00
          10         0.00         0.00
         100         0.00         0.00
        1000         0.00         0.00
       10000         0.01         0.00
      100000         0.15         0.01
     1000000         1.14         0.12
    10000000        11.44         2.24
   100000000       126.41        70.34
   200000000       198.04       258.44
   300000000       341.27       646.81
   400000000       522.70      1044.97
   500000000       564.07      1744.54
   600000000      1380.04      3364.06
   700000000      1797.82      3300.20
   800000000      1294.20      4410.22
   900000000      3045.38      7646.59
  1000000000      3467.31     11042.97
  1100000000      5162.35     13750.22
  1200000000      6581.90     12544.67
  1300000000      1612.60     17204.67
  1400000000      1788.13     23772.84
  1500000000      6922.16     25068.49
msg312249 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-02-16 19:12
Thanks for the info.  Here is what I surmise from the data.  

* The shape of the curve matches the expected memory latency curve for random memory accesses for a given working set size.  The graph of your measurements looks similar in shape to this one:  https://www.extremetech.com/wp-content/uploads/2014/08/latency.png

*  The last line of the measurements has N=1,500,000,000.   That would give an approx 70GB setobject that refers to another 100Gb (or more) of strings.  Unless you are working with an extraordinary machine, most of that data will have been swapped to disk.

* Dividing the 1,500,000,000 elements by the time of 25,068 seconds gives about 60,000 deletions per second.  This number coincides nicely with the random access performance of a typical SSD rated at  50,000 IOPs.

If this reasoning is sound, all it proves is that random accesses to virtual memory are limited by the speed of the hardware for that virtual memory.
msg312250 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-02-16 19:57
Luís, what you really need for the problem you outline is an immutable set with one operation, 'in', aside from create and delete.  If specialized to strings only, it could internally stores only character sequences, rather than Python objects.  I suspect someone has written such a class (in C), but did not find anything on pypi.python.org.

Lacking that, the following experiment suggests that you might be able to restore near O(n) time by partitioning the set of keys into a collection of sets.  My 16M set took 6.48 and 1.80 seconds.  Times 4 is 25.9 and 7.2.  My 64M set took 28 and 19 seconds, but 4 16M sets take 26.3 and 7.5 seconds, only about 5% more than the x4 target.

print(timeit.Timer(
    's = [{str(n) for n in range(i*1000000, (i+16)*1000000)}'
	        ' for i in (0, 16, 32, 48)]')
      .timeit(number=1))
print(timeit.Timer(
    'del s',
    's = [{str(n) for n in range(i*1000000, (i+16)*1000000)}'
	        ' for i in (0, 16, 32, 48)]')
      .timeit(number=1))
msg312263 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-02-17 02:22
> Surprisingly, deleting a very large set takes much longer than creating it.

Luis, that's not surprising ;-)  When you create it, it's mostly the case that there's a vast chunk of raw memory from which many pieces are passed out in address order (to hold all the newly created Python objects).  Memory access is thus mostly sequential.  But when you delete it, that vast chunk of once-raw memory is visited in essentially random order (string hashes impose a pseudo-random order on where (pointers to) string objects are stored within a set's vector), defeating all the hardware features that greatly benefit from sequential access.

More precisely, the set's internal vector is visited sequentially during deletion, but the string objects the pointers point _at_ are all over the place.  Even if nothing is swapped to disk, it's likely that visiting a string object during deletion will miss on all cache levels, falling back to (much slower) RAM.  Note that all the string objects must be visited during set deletion, in order to decrement their reference counts.
msg312264 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2018-02-17 02:59
One possible advice; try dict instead of set.
Dict is (surprisingly) little smaller than set.
And dict cleans items in insertion order when the dict is deleted.
msg312265 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2018-02-17 03:15
> Dict is (surprisingly) little smaller than set.

I'm sorry, I was wrong.
dict is smaller than set only when len(d) is small. (typical case for namespace dict)

In case of massive keys, dict is larger than set.
msg312278 - (view) Author: Luis Pedro Coelho (luis@luispedro.org) Date: 2018-02-17 12:09
I think some of this conversation is going off-topic, but there is no disk-swapping in my case.

I realize ours is not a typical setup, but our normal machines have 256GB of RAM and the "big memory" compute nodes are >=1TB. Normally, swap is outright disabled.

This really is an impressive case study on how much difference cache-locality can make.
msg312280 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018-02-17 13:44
Reminds of some experiments someone did a while ago as part of the
GIL removal attempts where the ref count integers are all kept in a
separate array. The intent there was to be able to do locking on
a single array rather than on individual decref cells.

This would solve the issue with having to jump around in memory
to decref all objects, but I'm not sure whether the overall win
would be a lot, since deallocation of the memory blocks typically
requires accessing the block itself as well (to update the block
chain list pointers), unless the memory allocator uses some
smart cache local block management as well (I believe that pymalloc
does, but could be wrong).

In any case, this sounds like a fun experiment for a GSoC student.
Perhaps the PSF could donate an AWS EC2 instance with enough RAM to
do the experiments.
msg312350 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2018-02-19 13:28
@Luis, would you try dict instead of set?  It's little larger than set, but delete elements by insertion order.

But I don't think builtin data structure can be optimized for such workload.
Maybe, LMBD[1] or some other KVS can help you.

[1]: https://lmdb.readthedocs.io/en/release/
msg337655 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-11 10:54
I thought compact set implementation similar to dict sicne Python 3.6 may fix this issue.
But as discussion with Raymond on Python-Dev ML, I conclude the merits of compact implementation is not significant enough.
I abandoned compact set branch.

Now I don't have any idea to fix this issue in foreseeable future.
Please use dict instead.
History
Date User Action Args
2019-03-11 10:54:14inada.naokisetstatus: open -> closed
resolution: wont fix
messages: + msg337655

stage: needs patch -> resolved
2018-02-19 13:28:39inada.naokisetmessages: + msg312350
2018-02-17 13:44:57lemburgsetmessages: + msg312280
2018-02-17 12:09:45luis@luispedro.orgsetmessages: + msg312278
2018-02-17 03:15:59inada.naokisetmessages: + msg312265
2018-02-17 02:59:24inada.naokisetmessages: + msg312264
2018-02-17 02:22:26tim.peterssetmessages: + msg312263
2018-02-16 19:57:30terry.reedysetmessages: + msg312250
2018-02-16 19:12:39rhettingersetmessages: + msg312249
2018-02-16 15:47:44luis@luispedro.orgsetfiles: + time-set.py
nosy: + luis@luispedro.org
messages: + msg312236

2018-02-16 05:02:15rhettingersetmessages: + msg312226
2018-02-15 17:34:48rhettingersetmessages: + msg312216
2018-02-15 16:53:46rhettingersetmessages: + msg312215
2018-02-15 15:24:36serhiy.storchakasetfiles: + bench_del.py
2018-02-15 14:57:27inada.naokisetnosy: + inada.naoki
2018-02-15 14:25:51serhiy.storchakasetnosy: + lemburg, tim.peters, twouters
messages: + msg312209
2018-02-15 12:28:41serhiy.storchakasetnosy: + serhiy.storchaka
2018-02-15 00:23:33terry.reedycreate