Drop the GIL during large bytes.join operations? #80232

bmerry · 2019-02-20T13:00:55Z

BPO	36051
Nosy	@smontanaro, @pitrou, @methane, @serhiy-storchaka, @MojoVampire, @remilapeyre, @isidentical, @bmerry
PRs	bpo-36051: drop GIL during large b''.join operations #17757 bpo-36051: Fix compiler warning. #18325
Files	old.csv: Benchmark results for trunk new.csv: Benchmark results when unconditionally dropping the GIL benchjoin.py: Benchmark script for join benchjoin2.py join.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-02-03.10:04:15.010>
created_at = <Date 2019-02-20.13:00:54.784>
labels = ['interpreter-core', '3.9', 'performance']
title = 'Drop the GIL during large bytes.join operations?'
updated_at = <Date 2020-02-03.10:04:15.010>
user = 'https://github.com/bmerry'

bugs.python.org fields:

activity = <Date 2020-02-03.10:04:15.010>
actor = 'methane'
assignee = 'none'
closed = True
closed_date = <Date 2020-02-03.10:04:15.010>
closer = 'methane'
components = ['Interpreter Core']
creation = <Date 2019-02-20.13:00:54.784>
creator = 'bmerry'
dependencies = []
files = ['48811', '48812', '48813', '48816', '48873']
hgrepos = []
issue_num = 36051
keywords = ['patch']
message_count = 21.0
messages = ['336082', '358783', '358784', '358800', '358803', '359084', '359097', '359102', '359105', '359108', '359110', '359111', '359182', '359193', '359225', '359328', '359347', '360108', '360938', '361075', '361276']
nosy_count = 8.0
nosy_names = ['skip.montanaro', 'pitrou', 'methane', 'serhiy.storchaka', 'josh.r', 'remi.lapeyre', 'BTaskaya', 'bmerry']
pr_nums = ['17757', '18325']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue36051'
versions = ['Python 3.9']

bmerry · 2019-02-20T13:00:55Z

A common pattern in libraries doing I/O is to receive data in chunks, put them in a list, then join them all together using b"".join(chunks). For example, see http.client.HTTPResponse._safe_read. When the output is large, the memory copies can block the interpreter for a non-trivial amount of time, and prevent multi-threaded scaling. If the GIL could be dropped during the memcpys it could improve parallel I/O performance in some high-bandwidth scenarios (36050 mentions a case where I've run into this serialisation bottleneck in practice).

Obviously it could hurt performance to drop the GIL for small cases. As far as I know numpy uses thresholds to decide when it's worth dropping the GIL and it seems to work fairly well.

serhiy-storchaka · 2019-12-21T20:21:54Z

Could you show evidences that dropping the GIL can help you? bytes.join() needs to perform operations which needs to hold the GIL (allocating the memory, iterating the list, getting the data of bytes-like objects). I afraid that the cost of memcpy() is a minor part of this.

pitrou · 2019-12-21T20:31:43Z

If there is a large chunk (e.g. several MBs), dropping the GIL during the memcpy of that chunk may be beneficial. This kind of optimization may be applicable in other similar cases (such as extending a bytearray or a BytesIO object).

methane · 2019-12-23T00:30:21Z

cpython/Objects/stringlib/join.h

Lines 105 to 126 in 068768f

    
           if (!seplen) { 
        
               /* fast path */ 
        
               for (i = 0; i < nbufs; i++) { 
        
                   Py_ssize_t n = buffers[i].len; 
        
                   char *q = buffers[i].buf; 
        
                   memcpy(p, q, n); 
        
                   p += n; 
        
               } 
        
               goto done; 
        
           } 
        
           for (i = 0; i < nbufs; i++) { 
        
               Py_ssize_t n; 
        
               char *q; 
        
               if (i) { 
        
                   memcpy(p, sepstr, seplen); 
        
                   p += seplen; 
        
               } 
        
               n = buffers[i].len; 
        
               q = buffers[i].buf; 
        
               memcpy(p, q, n); 
        
               p += n; 
        
           }

It seems we can release GIL during iterating the buffer array.
Even though there is no one big chunk, it would be beneficial if the output size is large enough (8MB+?).

bmerry · 2019-12-23T05:12:29Z

It seems we can release GIL during iterating the buffer array.

That's what I had in mind. Naturally it would require a bit of benchmarking to pick a threshold such that the small case doesn't lose performance due to locking overheads. If no one else is working on it, I can give that a try early next year.

MojoVampire · 2019-12-31T01:45:05Z

This will introduce a risk of data races that didn't previously exist. If you do:

    ba1 = bytearray(b'\x00') * 50000
    ba2 = bytearray(b'\x00') * 50000
    ... pass references to thread that mutates them ...
    ba3 = b''.join((ba1, ba2))

then two things will change from the existing behavior:

If the thread in question attempts to write to the bytearrays in place, then it could conceivably write data that is only partially picked up (ba1[0], ba1[40000] = 2, 3 could end up copying the results of the second write without the first; at present, it could only copy the first without the second)
If the thread tries to change the size of the bytearrays during the join (ba1 += b'123'), it'll die with a BufferError that wasn't previously possible

#1 isn't terrible (as noted, data races in that case already existed, this just lets them happen in more ways), but #2 is a little unpleasant; code that previously had simple data races (the data might be inconsistent, but the code ran and produced some valid output) can now fail hard, nowhere near the actual call to join that introduced the behavioral change.

I don't think this sinks the patch (loudly breaking code that was silently broken before isn't awful), but I feel like a warning of some kind in the documentation (if only a simple compatibility note in What's New) might be appropriate.

methane · 2019-12-31T04:42:58Z

If the thread tries to change the size of the bytearrays during the join (ba1 += b'123'), it'll die with a BufferError that wasn't previously possible

Makes sense. We shouldn't drop GIL while having buffer of arbitrary objects.

bmerry · 2019-12-31T06:02:54Z

If we want to be conservative, we could only drop the GIL if all the buffers pass the PyBytes_CheckExact test. Presumably that won't encounter any of these problems because bytes objects are immutable?

serhiy-storchaka · 2019-12-31T07:46:15Z

Please provide benchmarks that demonstrate the benefit of this change.

We also need to add a test for join() which covers the new code.

bmerry · 2019-12-31T09:24:43Z

I've attached a benchmark script and CSV results for master (whichever version that was at the point I forked) and with unconditional dropping of the GIL. It shows up to 3x performance improvement when using 4 threads. That's on my home desktop, which is quite old (Sandy Bridge). I'm expecting more significant gains on server CPUs, whose memory systems are optimised for multi-threaded workloads. The columns are chunk size, number of chunks, number of threads, and per-thread throughput.

There are also cases where using multiple threads is a slowdown, but I think that's an artifact of the benchmark. It repeatedly joins the same strings, so performance is higher when they all fit in the cache; when using 4 threads that execute in parallel, the working set is 4x larger and may cease to fit in cache. In real-world usage one is unlikely to be joining the same strings again and again.

In the single-threaded case, the benchmark seems to show that for 64K+, performance is improved by dropping the GIL (which I'm guessing must be statistical noise, since there shouldn't be anything contending for it), which is my reasoning behind the 65536 threshold.

I'll take a look at extra unit tests soon. Do you know off the top of your head where to look for existing join tests to add to?

bmerry · 2019-12-31T09:53:49Z

I'll take a look at extra unit tests soon. Do you know off the top of your head where to look for existing join tests to add to?

Never mind, I found it:

cpython/Lib/test/test_bytes.py

Lines 535 to 559 in 92709a2

    
           def test_join(self): 
        
               self.assertEqual(self.type2test(b"").join([]), b"") 
        
               self.assertEqual(self.type2test(b"").join([b""]), b"") 
        
               for lst in [[b"abc"], [b"a", b"bc"], [b"ab", b"c"], [b"a", b"b", b"c"]]: 
        
                   lst = list(map(self.type2test, lst)) 
        
                   self.assertEqual(self.type2test(b"").join(lst), b"abc") 
        
                   self.assertEqual(self.type2test(b"").join(tuple(lst)), b"abc") 
        
                   self.assertEqual(self.type2test(b"").join(iter(lst)), b"abc") 
        
               dot_join = self.type2test(b".:").join 
        
               self.assertEqual(dot_join([b"ab", b"cd"]), b"ab.:cd") 
        
               self.assertEqual(dot_join([memoryview(b"ab"), b"cd"]), b"ab.:cd") 
        
               self.assertEqual(dot_join([b"ab", memoryview(b"cd")]), b"ab.:cd") 
        
               self.assertEqual(dot_join([bytearray(b"ab"), b"cd"]), b"ab.:cd") 
        
               self.assertEqual(dot_join([b"ab", bytearray(b"cd")]), b"ab.:cd") 
        
               # Stress it with many items 
        
               seq = [b"abc"] * 1000 
        
               expected = b"abc" + b".:abc" * 999 
        
               self.assertEqual(dot_join(seq), expected) 
        
               self.assertRaises(TypeError, self.type2test(b" ").join, None) 
        
               # Error handling and cleanup when some item in the middle of the 
        
               # sequence has the wrong type. 
        
               with self.assertRaises(TypeError): 
        
                   dot_join([bytearray(b"ab"), "cd", b"ef"]) 
        
               with self.assertRaises(TypeError): 
        
                   dot_join([memoryview(b"ab"), "cd", b"ef"])

Do you think it would be sufficient to change the stress test from joining 1000 items to joining 100000 items?

bmerry · 2019-12-31T11:07:34Z

Do you think it would be sufficient to change the stress test from joining 1000 items to joining 100000 items?

Actually that won't work, because the existing stress test is using a non-empty separator. I'll add another version of that stress test that uses an empty separator.

methane · 2020-01-02T05:35:13Z

In the single-threaded case, the benchmark seems to show that for 64K+, performance is improved by dropping the GIL (which I'm guessing must be statistical noise, since there shouldn't be anything contending for it), which is my reasoning behind the 65536 threshold.

Yes, single-threaded case shouldn't affected by this change. Dropping GIL is just a very little overhead.

So you need to focus on multi-threaded case when considering threshold.
As far as seeing your result, 65536 threshold seems too small.
See 512*8192 case. It shows 1479->980 regression.
But I don't think your script shows real performance.

I updated your patch, and I saw small improvement on 512K and significant improvement on 1M.

bmerry · 2020-01-02T12:17:18Z

I'm realising that the benchmark makes it difficult to see what's going on because it doesn't separate overhead costs (slowdowns because releasing/acquiring the GIL is not free, particularly when contended) from cache effects (slowdowns due to parallel threads creating more cache pressure than threads that take turns). inada.naoki's version of the benchmark is better here because it uses the same input data for all the threads, but the output data will still be different in each thread.

For example, on my system I see a big drop in speedup (although I still get speedup) with the new benchmark once the buffer size gets to 2MB per thread, which is not surprising with an 8MB L3 cache.

My feeling is that we should try to ignore cache effects when picking a threshold, because we can't predict them generically (they'll vary by workload, thread count, CPU etc) whereas users can benchmark specific use cases to decide whether multithreading gives them a benefit. If the threshold is too low then users can always choose not to use multi-threading (and in general one doesn't expect much from it in Python) but if the threshold is too high then users have no recourse. That being said, 65536 does still seem a bit low based on the results available.

I'll try to write a variant of the benchmark in which other threads just spin in Python without creating memory pressure to see if that gives a different picture. I'll also run the benchmark on a server CPU when I'm back at work.

methane · 2020-01-03T02:42:33Z

(slowdowns because releasing/acquiring the GIL is not free, particularly when contended)

Yes, it's relatively high. We shouldn't release the GIL only for ~0.5ms. That's why 1MB~ seems nice threshold.

If the threshold is too low then users can always choose not to use multi-threading (and in general one doesn't expect much from it in Python)

I don't like this idea. We shouldn't force users to change their program from multi threaded to single threaded. So we need to avoid large performance regression.

but if the threshold is too high then users have no recourse. That being said, 65536 does still seem a bit low based on the results available.

But they can get the benefit when input is too large. 65536 seems "too low" to me.

bmerry · 2020-01-05T10:33:38Z

I've written a variant of the benchmark in which one thread does joins and the other does unrelated CPU-bound work that doesn't touch memory much. It also didn't show much benefit to thresholds below 512KB. I still want to test things on a server-class CPU, but based on the evidence so far I'm okay with a 1MB threshold.

bmerry · 2020-01-05T14:49:42Z

I ran the test on a Xeon machine (Skylake-XP) and it also looks like performance is only improved from 1MB up (somewhat to my surprise, given how poor single-threaded memcpy performance is on that machine). So I've updated the pull request with that threshold.

bmerry · 2020-01-16T09:38:24Z

I think I've addressed the concerns that were raised in this bug, but let me know if I've missed any.

methane · 2020-01-29T07:09:34Z

New changeset d07d9f4 by Bruce Merry in branch 'master':
bpo-36051: Drop GIL during large bytes.join() (GH-17757)
d07d9f4

smontanaro · 2020-01-30T17:04:16Z

I think to avoid compiler warnings about 'save' perhaps being used uninitialized, it should be initialized to NULL when declared on line 21 of Objects/stringlib/join.h.

methane · 2020-02-03T10:03:43Z

New changeset 869c0c9 by Inada Naoki in branch 'master':
bpo-36051: Fix compiler warning. (GH-18325)
869c0c9

bmerry mannequin added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 20, 2019

SilentGhost mannequin added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Feb 20, 2019

SilentGhost mannequin changed the title ~~(Performance) Drop the GIL during large bytes.join operations?~~ Drop the GIL during large bytes.join operations? Feb 20, 2019

SilentGhost mannequin added the performance Performance or resource usage label Feb 20, 2019

isidentical added 3.9 only security fixes and removed 3.8 only security fixes labels Dec 21, 2019

methane closed this as completed Jan 29, 2020

smontanaro reopened this Jan 30, 2020

methane closed this as completed Feb 3, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop the GIL during large bytes.join operations? #80232

Drop the GIL during large bytes.join operations? #80232

bmerry mannequin commented Feb 20, 2019

bmerry mannequin commented Feb 20, 2019

serhiy-storchaka commented Dec 21, 2019

pitrou commented Dec 21, 2019

methane commented Dec 23, 2019

bmerry mannequin commented Dec 23, 2019

MojoVampire mannequin commented Dec 31, 2019

methane commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

serhiy-storchaka commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

methane commented Jan 2, 2020

bmerry mannequin commented Jan 2, 2020

methane commented Jan 3, 2020

bmerry mannequin commented Jan 5, 2020

bmerry mannequin commented Jan 5, 2020

bmerry mannequin commented Jan 16, 2020

methane commented Jan 29, 2020

smontanaro commented Jan 30, 2020

methane commented Feb 3, 2020

Drop the GIL during large bytes.join operations? #80232

Drop the GIL during large bytes.join operations? #80232

Comments

bmerry mannequin commented Feb 20, 2019

bmerry mannequin commented Feb 20, 2019

serhiy-storchaka commented Dec 21, 2019

pitrou commented Dec 21, 2019

methane commented Dec 23, 2019

bmerry mannequin commented Dec 23, 2019

MojoVampire mannequin commented Dec 31, 2019

methane commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

serhiy-storchaka commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

bmerry mannequin commented Dec 31, 2019

methane commented Jan 2, 2020

bmerry mannequin commented Jan 2, 2020

methane commented Jan 3, 2020

bmerry mannequin commented Jan 5, 2020

bmerry mannequin commented Jan 5, 2020

bmerry mannequin commented Jan 16, 2020

methane commented Jan 29, 2020

smontanaro commented Jan 30, 2020

methane commented Feb 3, 2020