New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON loads performance improvement for long strings #81768
Comments
I analysed the performance of json.loads in one production workload we have. Spy-py tells me the majority of time is spent into C json module (see events.svg) Digging deeper, linux perf tells me hottest loop (where 20%+ of the time is spent) in _json.scanstring_unicode, in this loop: 189: movzx eax,BYTE PTR [rbp+rbx*1+0x0] which is related to this code in Modules/_json.c /* Find the end of the string or the next escape */
Py_UCS4 c = 0;
for (next = end; next < len; next++) {
c = PyUnicode_READ(kind, buf, next);
if (c == '"' || c == '\\') {
break;
}
else if (strict && c <= 0x1f) {
raise_errmsg("Invalid control character at", pystr, next);
goto bail;
}
} Two optimisations can be done:
Running the pyperformance json_loads benchmark I get: +------------+----------------------+-----------------------------+ A micro benchmark on a 1MB long json string gives better results: python -m pyperf timeit -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)" +-----------+------------+-----------------------------+ |
Also on my real workload (loading 60GB jsonl file containing mostly strings) I measured a 10% improvement |
Here's the real world example $ ls -hs events-100k.json
84M events-100k.json +-----------+-------------------------+-----------------------------+ |
How can we lazy "mov DWORD PTR [rsp+0x44],eax"? |
Some compilers produce inefficient code for PR-14752. $ perf record ./python -m pyperf timeit -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)" # PR-14752 gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0
gcc-8 (Ubuntu 8.3.0-6ubuntu1) 8.3.0
clang version 7.0.1-8 (tags/RELEASE_701/final)
clang version 8.0.0-3 (tags/RELEASE_800/final)
# modified
gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0
gcc-8 (Ubuntu 8.3.0-6ubuntu1) 8.3.0
clang version 7.0.1-8 (tags/RELEASE_701/final)
clang version 8.0.0-3 (tags/RELEASE_800/final)
|
Marco has a newer patch with better performance that we came up with at the sprints, but apparently it hasn't been pushed yet. Hopefully he'll get that up soon and we can review it instead - the current PR wasn't as reliably good as initial testing suggested. |
On gcc, running the tests above, the only change that is relevant for speedup is switching around the strict check. Removing the extra MOV related to the outer "c" variable is not significant (at least on gcc and the few tests I did) Unfortunately I had to change the patch we did together during the sprint because it was breaking the strict check logic... I updated my PR accordingly, kept only the bare minimum. |
I am also working on a different patch that uses the "pcmpestri" SSE4 processor instruction, it looks like this for now. While at it I realized there is (maybe) another potential speedup: avoiding the ucs4lib_find_max_char we do for each chunk of the string ( that entails scanning the string in memory one more time)... anyways that's another (much longer) story, probably for another issue?
|
I forgot to mention, I was inspired by @christian.heimes 's talk at EuroPython 2019 https://ep2019.europython.eu/talks/es2pZ6C-introduction-to-low-level-profiling-and-tracing/ (thanks!) |
I went ahead and merged the minimal PR and flagged it for backporting to 3.8 - it's an obviously beneficial change, that clearly does less work on each pass through the loop. Even if you are doing non-strict parsing of a string that consists entirely of invalid characters, you'll get the same level of performance that the previous code offered for all strict parsing. |
Wait... there is no benchmark for the "minimum change". I tested 4 compilers, and provide much better patch in https://bugs.python.org/issue37587#msg348114 |
While you're testing patches, can you try this version too? Py_UCS4 c = 0, minc = 0x20;
for (next = end; next < len; next++) {
c = PyUnicode_READ(kind, buf, next);
if (c == '"' || c == '\\') {
break;
}
minc = c < minc ? c : minc;
}
if (strict && minc <= 0x1f) {
raise_errmsg("Invalid control character at", pystr, next);
goto bail;
} When we tried this, the conditional expression became a "cmovl" operator which removed 3-4 branches from within the loop entirely, and it was 18% better than the baseline (which has now moved...) |
Oh, we also need to capture "next"... but then again, since the success case is far more common, I'd be okay with scanning again to find it. |
@steve.dower yes, that's what made me discard that experiment we did during the sprint. Ok will test your new patch soon |
Since scope of "c" is very wide, and there is even c = PyUnicode_READ(kind, buf, next); That is the bottleneck. My patch used a new variable with tight scope so compiler can bypass store it to the stack. |
The disassembly we looked at didn't do this, so it may just be certain compilers. Perhaps we can actually use the register keyword to help them out? :) Here's a slightly altered one that doesn't require rescanning for the sake of the error message: Py_UCS4 c = 0, minc = strict ? 0x20 : 0x00;
for (next = end; next < len; next++) {
c = PyUnicode_READ(kind, buf, next);
if (c == '"' || c == '\\' || c < minc) {
break;
}
}
if (c < minc) {
raise_errmsg("Invalid control character at", pystr, next);
goto bail;
} |
I tested before, after, Steve's patch, and my patch with gcc 8.3.0 and PGO+LTO. Surprisingly, there is no difference. Even my patch didn't help register allocation when PGO is enabled. I will run same test with other compilers & PGO (enabled|disabled). |
I'm sorry, I was wrong. PGO did very nice job on all cases. |
This issue is very compiler sensitive. Now I'm facing strange behavior. pyperf reports slower time (1ms) for PGO builds, although disasm looks good. |
I tried without PGO and confirmed performance improved on GCC 7.2.0. $ ./python -m pyperf timeit -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)" old: 9211e2 gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0 old: Mean +- std dev: 721 us +- 0 us gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0 old: Mean +- std dev: 1.03 ms +- 0.00 ms clang version 7.0.1-8 (tags/RELEASE_701/final) old: Mean +- std dev: 721 us +- 1 us clang version 8.0.0-3 (tags/RELEASE_800/final) old: Mean +- std dev: 721 us +- 0 us |
And I confirmed performance improvement by my patch (GH-15134) on all of 4 compilers. $ ./python -m pyperf timeit -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)" old: 9211e2 gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0 old: Mean +- std dev: 721 us +- 0 us gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0 old: Mean +- std dev: 1.03 ms +- 0.00 ms clang version 7.0.1-8 (tags/RELEASE_701/final) old: Mean +- std dev: 721 us +- 1 us clang version 8.0.0-3 (tags/RELEASE_800/final) old: Mean +- std dev: 721 us +- 0 us |
I also confirm Inada's patch further improves performance! All my previous benchmarks were done with gcc and PGO optimizations performed only with test_json task... maybe this explains the weird results? I tested the performance of new master 69f37bc with *all* PGO task reverting my original patch: iff --git a/Modules/_json.c b/Modules/_json.c
index 112903ea57..9b63167276 100644
--- a/Modules/_json.c
+++ b/Modules/_json.c
@@ -442,7 +442,7 @@ scanstring_unicode(PyObject *pystr, Py_ssize_t end, int strict, Py_ssize_t *next
if (d == '"' || d == '\\') {
break;
}
- if (d <= 0x1f && strict) {
+ if (strict && d <= 0x1f) {
raise_errmsg("Invalid control character at", pystr, next);
goto bail;
} ... and surprise... Mean +- std dev: [69f37bc] 5.29 us +- 0.07 us -> [69f37bc-patched] 5.11 us +- 0.03 us: 1.04x faster (-4%) should we revert my original patch entirely now? Or am I missing something? |
also worth noting escape sequences for non-ascii characters are slower, even when encoded length is the same. python -m pyperf timeit -s 'import json;' -s 'c = "€"; s = json.dumps(c * (2**10 // len(json.dumps(c)) - 2))' 'json.loads(s)' -o nonascii2k.json python -m pyperf timeit -s 'import json;' -s 'c = "a"; s = json.dumps(c * (2**10 // len(json.dumps(c)) - 2))' 'json.loads(s)' -o ascii2k.json Mean +- std dev: [ascii2k] 2.59 us +- 0.04 us -> [nonascii2k] 9.98 us +- 0.12 us: 3.86x slower (+286%) |
ops sorry here's the right commands python -m pyperf timeit -s 'import json;' -s 'c = "a"; s = json.dumps(c * (2**10 // (len(json.dumps(c)) - 2)))' 'json.loads(s)' -o ascii2k.json Mean +- std dev: [ascii2k] 3.69 us +- 0.05 us -> [nonascii2k] 12.4 us +- 0.1 us: 3.35x slower (+235%) |
ujson (https://github.com/esnme/ultrajson) instead is faster when decoding non-ascii in the same example above, so it is likely there is room for improvement... |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: