Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON loads performance improvement for long strings #81768

Closed
mpaolini mannequin opened this issue Jul 13, 2019 · 30 comments
Closed

JSON loads performance improvement for long strings #81768

mpaolini mannequin opened this issue Jul 13, 2019 · 30 comments
Labels
3.8 only security fixes 3.9 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@mpaolini
Copy link
Mannequin

mpaolini mannequin commented Jul 13, 2019

BPO 37587
Nosy @rhettinger, @ncoghlan, @tiran, @ezio-melotti, @methane, @serhiy-storchaka, @zooba, @mpaolini, @pablogsal, @miss-islington
PRs
  • bpo-37587: Make json.loads faster for long strings #14752
  • [3.8] bpo-37587: Make json.loads faster for long strings (GH-14752) #15022
  • bpo-37587: optimize json.loads for large string #15134
  • bpo-37587: _json: use _PyUnicodeWriter when scanning string #15591
  • Files
  • events.svg: spy-py flamegraph
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-10-17.08:30:11.441>
    created_at = <Date 2019-07-13.14:42:02.357>
    labels = ['3.8', 'library', '3.9', 'performance']
    title = 'JSON loads performance improvement for long strings'
    updated_at = <Date 2019-10-17.08:30:11.441>
    user = 'https://github.com/mpaolini'

    bugs.python.org fields:

    activity = <Date 2019-10-17.08:30:11.441>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-10-17.08:30:11.441>
    closer = 'methane'
    components = ['Library (Lib)']
    creation = <Date 2019-07-13.14:42:02.357>
    creator = 'mpaolini'
    dependencies = []
    files = ['48476']
    hgrepos = []
    issue_num = 37587
    keywords = ['patch']
    message_count = 30.0
    messages = ['347832', '347839', '347854', '348112', '348114', '348336', '348707', '348709', '348710', '348743', '348744', '348745', '348746', '348748', '348749', '348750', '348751', '348753', '348758', '348759', '348762', '349040', '349042', '349226', '349823', '349825', '349826', '349830', '350761', '354833']
    nosy_count = 10.0
    nosy_names = ['rhettinger', 'ncoghlan', 'christian.heimes', 'ezio.melotti', 'methane', 'serhiy.storchaka', 'steve.dower', 'mpaolini', 'pablogsal', 'miss-islington']
    pr_nums = ['14752', '15022', '15134', '15591']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue37587'
    versions = ['Python 3.8', 'Python 3.9']

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 13, 2019

    I analysed the performance of json.loads in one production workload we have.

    Spy-py tells me the majority of time is spent into C json module (see events.svg)

    Digging deeper, linux perf tells me hottest loop (where 20%+ of the time is spent) in _json.scanstring_unicode, in this loop:

    189: movzx eax,BYTE PTR [rbp+rbx*1+0x0]
    mov DWORD PTR [rsp+0x44],eax
    cmp eax,0x22
    je 22f
    cmp eax,0x5c
    je 22f
    test r13d,r13d
    je 180
    cmp eax,0x1f

    which is related to this code in Modules/_json.c

            /* Find the end of the string or the next escape */
            Py_UCS4 c = 0;
            for (next = end; next < len; next++) {
                c = PyUnicode_READ(kind, buf, next);
                if (c == '"' || c == '\\') {
                    break;
                }
                else if (strict && c <= 0x1f) {
                    raise_errmsg("Invalid control character at", pystr, next);
                    goto bail;
                }
            }

    Two optimisations can be done:

    1. remove the mov entirely. It is not needed inside the loop and it is only needed later, outside the loop to access the variable
    2. switch around the strict check (in the second if) because strict defaults to 1 so it will likely pass the test, while the likelyness of finding an invalid character is lower

    Running the pyperformance json_loads benchmark I get:

    +------------+----------------------+-----------------------------+
    | Benchmark | vanilla-pyperf-pgo38 | patched-pyperf-pgo38 |
    +============+======================+=============================+
    | json_loads | 54.9 us | 53.9 us: 1.02x faster (-2%) |
    +------------+----------------------+-----------------------------+

    A micro benchmark on a 1MB long json string gives better results:

    python -m pyperf timeit -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)"

    +-----------+------------+-----------------------------+
    | Benchmark | vanilla-1m | patched-1m |
    +===========+============+=============================+
    | timeit | 2.62 ms | 2.39 ms: 1.10x faster (-9%) |
    +-----------+------------+-----------------------------+

    @mpaolini mpaolini mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Jul 13, 2019
    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 13, 2019

    Also on my real workload (loading 60GB jsonl file containing mostly strings) I measured a 10% improvement

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 13, 2019

    Here's the real world example

    $ ls -hs events-100k.json 
    84M events-100k.json

    +-----------+-------------------------+-----------------------------+
    | Benchmark | vanilla-bpo-events-100k | patched-bpo-events-100k |
    +===========+=========================+=============================+
    | timeit | 985 ms | 871 ms: 1.13x faster (-12%) |
    +-----------+-------------------------+-----------------------------+

    @methane
    Copy link
    Member

    methane commented Jul 18, 2019

    1. remove the mov entirely. It is not needed inside the loop and it is only needed later, outside the loop to access the variable

    How can we lazy "mov DWORD PTR [rsp+0x44],eax"?

    @methane
    Copy link
    Member

    methane commented Jul 18, 2019

    Some compilers produce inefficient code for PR-14752.
    I wrote another patch which is friendly to more compilers.

    $ perf record ./python -m pyperf timeit  -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)"

    # PR-14752

    gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0
    Mean +- std dev: 1.11 ms +- 0.01 ms

           │     scanstring_unicode():
           │                 c = PyUnicode_READ(kind, buf, next);
     11.92 │270:   movzbl (%r15,%r8,1),%eax
           │                 if (c == '"' || c == '\\') {
     27.97 │       cmp    $0x22,%eax
           │                 c = PyUnicode_READ(kind, buf, next);
     29.22 │       mov    %eax,0x34(%rsp)
           │                 if (c == '"' || c == '\\') {
      0.46 │     ↑ je     ef
      0.02 │       cmp    $0x5c,%eax
           │     ↑ je     ef
           │                 if (c <= 0x1f && invalid < 0) {
           │       cmp    $0x1f,%eax
      0.00 │     ↓ ja     297
           │       test   %rdx,%rdx
           │       cmovs  %r8,%rdx
           │             for (next = end; next < len; next++) {
     29.49 │297:   add    $0x1,%r8
           │       cmp    %r8,%r12
      0.92 │     ↑ jne    270
    

    gcc-8 (Ubuntu 8.3.0-6ubuntu1) 8.3.0
    Mean +- std dev: 712 us +- 1 us

           │                 c = PyUnicode_READ(kind, buf, next);
           │188:   movzbl 0x0(%rbp,%rbx,1),%eax
           │       mov    %eax,0x34(%rsp)
           │                 if (c == '"' || c == '\\') {
           │       cmp    $0x22,%eax
           │     ↓ je     1d0
           │       nop
      0.00 │1a0:   cmp    $0x5c,%eax
           │     ↓ je     1d0
           │                 if (c <= 0x1f && invalid < 0) {
           │       cmp    $0x1f,%eax
     49.84 │     ↓ ja     1b1
           │       test   %rdx,%rdx
           │       cmovs  %rbx,%rdx
           │             for (next = end; next < len; next++) {
           │1b1:   add    $0x1,%rbx
      0.00 │       cmp    %rbx,%r15
           │     ↑ je     ff
           │                 c = PyUnicode_READ(kind, buf, next);
      0.61 │       movzbl 0x0(%rbp,%rbx,1),%eax
     49.53 │       mov    %eax,0x34(%rsp)
           │                 if (c == '"' || c == '\\') {
      0.01 │       cmp    $0x22,%eax
           │     ↑ jne    1a0
      0.00 │       nop
    

    clang version 7.0.1-8 (tags/RELEASE_701/final)
    Mean +- std dev: 951 us +- 1 us

           │                 c = PyUnicode_READ(kind, buf, next);
      9.76 │110:   movzbl (%r12,%r13,1),%eax
      9.47 │       mov    %eax,0xc(%rsp)
      8.85 │       cmp    $0x22,%eax
           │                 if (c == '"' || c == '\\') {
           │     ↓ je     170
      8.78 │       cmp    $0x5c,%al
           │     ↓ je     170
           │                 if (c <= 0x1f && invalid < 0) {
      9.16 │       cmp    $0x20,%al
      9.09 │       mov    %rdx,%rcx
      9.16 │       cmovb  %r13,%rcx
      9.00 │       test   %rdx,%rdx
      8.78 │       cmovs  %rcx,%rdx
           │             for (next = end; next < len; next++) {
      9.09 │       add    $0x1,%r13
           │       cmp    %r15,%r13
      8.86 │     ↑ jl     110
           │     ↓ jmp    170
           │       nop
    

    clang version 8.0.0-3 (tags/RELEASE_800/final)
    Mean +- std dev: 953 us +- 0 us

           │                 c = PyUnicode_READ(kind, buf, next);
     10.04 │100:   movzbl (%r15,%r14,1),%eax
      9.27 │       mov    %eax,0x4(%rsp)
      8.87 │       cmp    $0x22,%eax
           │                 if (c == '"' || c == '\\') {
           │     ↓ je     160
      8.78 │       cmp    $0x5c,%al
           │     ↓ je     160
           │                 if (c <= 0x1f && invalid < 0) {
      8.97 │       cmp    $0x20,%al
      8.97 │       mov    %rdx,%rcx
      8.89 │       cmovb  %r14,%rcx
      8.81 │       test   %rdx,%rdx
      9.14 │       cmovs  %rcx,%rdx
           │             for (next = end; next < len; next++) {
      9.25 │       add    $0x1,%r14
           │       cmp    %rdi,%r14
      8.99 │     ↑ jl     100
           │     ↓ jmp    160
           │       nop
    

    # modified

            /* Find the end of the string or the next escape */
            Py_UCS4 c;
            {
                Py_UCS4 d = 0;
                for (next = end; next < len; next++) {
                    d = PyUnicode_READ(kind, buf, next);
                    if (d == '"' || d == '\\') {
                        break;
                    }
                    if (d <= 0x1f && strict) {
                        raise_errmsg("Invalid control character at", pystr, next);
                        goto bail;
                    }
                }
                c = d;
            }
    

    gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0
    Mean +- std dev: 708 us +- 1 us

           │                 for (next = end; next < len; next++) {
     20.29 │170:   add    $0x1,%rbx
      0.31 │       cmp    %rbx,%r12
           │     ↓ je     1b0
           │                     d = PyUnicode_READ(kind, buf, next);
     44.48 │179:   movzbl 0x0(%rbp,%rbx,1),%eax
           │                     if (d == '"' || d == '\\') {
      5.38 │       cmp    $0x22,%eax
           │     ↓ je     2c0
     23.82 │       cmp    $0x5c,%eax
           │     ↓ je     2c0
           │                     if (d <= 0x1f && strict) {
           │       cmp    $0x1f,%eax
      5.68 │     ↑ ja     170
           │       test   %r13d,%r13d
           │     ↑ jne    ed
    

    gcc-8 (Ubuntu 8.3.0-6ubuntu1) 8.3.0
    Mean +- std dev: 708 us +- 1 us

           │                 for (next = end; next < len; next++) {
      6.54 │170:   add    $0x1,%rbx
     19.25 │       cmp    %rbx,%r12
           │     ↓ jle    341
           │                     d = PyUnicode_READ(kind, buf, next);
     13.89 │17d:   movzbl 0x0(%rbp,%rbx,1),%eax
           │                     if (d == '"' || d == '\\') {
     34.26 │       cmp    $0x22,%eax
           │     ↓ je     1e8
      6.88 │       cmp    $0x5c,%eax
           │     ↓ je     1e8
           │                     if (d <= 0x1f && strict) {
           │       cmp    $0x1f,%eax
     19.17 │     ↑ ja     170
           │       test   %r14d,%r14d
           │     ↑ je     170
           │     ↑ jmpq   ed
    

    clang version 7.0.1-8 (tags/RELEASE_701/final)
    Mean +- std dev: 722 us +- 10 us

           │                     d = PyUnicode_READ(kind, buf, next);
     11.62 │ c0:┌─→movzbl (%r12,%r13,1),%eax
     11.99 │    │  cmp    $0x22,%eax
           │    │                if (d == '"' || d == '\\') {
           │    │↓ je     1f0
      9.61 │    │  cmp    $0x5c,%al
     22.56 │    │↓ je     1f0
           │    │                    break;
           │    │                }
           │    │                if (d <= 0x1f && strict) {
      8.94 │    │  cmp    $0x20,%al
           │    │↓ jb     b4a
           │    │            for (next = end; next < len; next++) {
     12.53 │    │  add    $0x1,%r13
           │    ├──cmp    %r15,%r13
     22.72 │    └──jl     c0
           │     ↓ jmpq   1f0
    

    clang version 8.0.0-3 (tags/RELEASE_800/final)
    Mean +- std dev: 707 us +- 1 us

           │                     d = PyUnicode_READ(kind, buf, next);
      0.01 │ b0:   movzbl (%r12,%r13,1),%eax
     23.84 │       cmp    $0x22,%eax
           │                     if (d == '"' || d == '\\') {
      0.00 │     ↓ je     1c0
      0.01 │       cmp    $0x5c,%al
           │     ↓ je     1c0
           │                         break;
           │                     }
           │                     if (d <= 0x1f && strict) {
     26.23 │       cmp    $0x20,%al
           │     ↓ jb     b1e
           │                 for (next = end; next < len; next++) {
           │       add    $0x1,%r13
           │       cmp    %r15,%r13
     49.91 │     ↑ jl     b0
           │     ↓ jmpq   1c0
    

    @methane methane removed 3.7 (EOL) end of life 3.8 only security fixes labels Jul 19, 2019
    @zooba
    Copy link
    Member

    zooba commented Jul 23, 2019

    Marco has a newer patch with better performance that we came up with at the sprints, but apparently it hasn't been pushed yet. Hopefully he'll get that up soon and we can review it instead - the current PR wasn't as reliably good as initial testing suggested.

    @zooba zooba added the 3.8 only security fixes label Jul 23, 2019
    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 30, 2019

    On gcc, running the tests above, the only change that is relevant for speedup is switching around the strict check. Removing the extra MOV related to the outer "c" variable is not significant (at least on gcc and the few tests I did)

    Unfortunately I had to change the patch we did together during the sprint because it was breaking the strict check logic...

    I updated my PR accordingly, kept only the bare minimum.

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 30, 2019

    I am also working on a different patch that uses the "pcmpestri" SSE4 processor instruction, it looks like this for now.

    While at it I realized there is (maybe) another potential speedup: avoiding the ucs4lib_find_max_char we do for each chunk of the string ( that entails scanning the string in memory one more time)... anyways that's another (much longer) story, probably for another issue?

    diff --git a/Modules/_json.c b/Modules/_json.c
    index 38beb6f50d..25b1cf4a99 100644
    --- a/Modules/_json.c
    +++ b/Modules/_json.c
    @@ -400,6 +400,38 @@ _build_rval_index_tuple(PyObject *rval, Py_ssize_t idx) {
             Py_CLEAR(chunk); \
         }
     
    +
    +inline unsigned int
    +_fast_search(const void *needle, unsigned int needle_len, const void *haystack, unsigned int haystack_len)
    +{
    +  unsigned int pos;
    +  __asm__ __volatile__("movq (%1), %%xmm1;\n"
    +                       "mov %2, %%eax;\n"
    +                       "movq %3, %%r8;\n"
    +                       "mov %4, %%edx;\n"
    +                       ".intel_syntax noprefix;\n"
    +                       "loop: pcmpestri xmm1, [r8], 0;\n" /* 0 = equal any */
    +                       /* "pcmpestri %%mm1, (%%r8), $0;\n" /\* 0 = equal any *\/ */
    +                       ".att_syntax prefix;\n"
    +                       "cmp $15, %%ecx;\n"
    +                       "jbe found;\n"
    +                       "sub $16, %%edx;\n"
    +                       "jnge notfound;\n"
    +                       "add $16, %%r8;\n"
    +                       "jmp loop;\n"
    +                       "notfound: movl %4, %%ecx;\n"
    +                       "jmp exit;\n"
    +                       "found: mov %4, %%eax;\n"
    +                       "sub %%edx, %%eax;\n"
    +                       "add %%eax, %%ecx;\n"
    +                       "exit: mov %%ecx, %0;\n"
    +                       :"=m"(pos)
    +                       :"r"(needle), "r"(needle_len), "r"(haystack), "r"(haystack_len)
    +                       :"%eax", "%edx", "%ecx", "%r8", "%xmm1");
    +  return pos;
    +}
    +
    +
     static PyObject *
     scanstring_unicode(PyObject *pystr, Py_ssize_t end, int strict, Py_ssize_t *next_end_ptr)
     {
    @@ -431,17 +463,26 @@ scanstring_unicode(PyObject *pystr, Py_ssize_t end, int strict, Py_ssize_t *next
             PyErr_SetString(PyExc_ValueError, "end is out of bounds");
             goto bail;
         }
    +    char needle[2];
    +    needle[0] = '"';
    +    needle[1] = '\\';
         while (1) {
             /* Find the end of the string or the next escape */
             Py_UCS4 c = 0;
    -        for (next = end; next < len; next++) {
    +        if (kind == PyUnicode_1BYTE_KIND) {
    +          next = _fast_search(needle, 2, buf+end, len-end) + end;
    +          if (next < len)
                 c = PyUnicode_READ(kind, buf, next);
    -            if (c == '"' || c == '\\') {
    -                break;
    -            }
    -            else if (strict && c <= 0x1f) {
    -                raise_errmsg("Invalid control character at", pystr, next);
    -                goto bail;
    +        } else {
    +            for (next = end; next < len; next++) {
    +                c = PyUnicode_READ(kind, buf, next);
    +                if (c == '"' || c == '\\') {
    +                    break;
    +                }
    +                else if (strict && c <= 0x1f) {
    +                    raise_errmsg("Invalid control character at", pystr, next);
    +                    goto bail;
    +                }
                 }
             }
             if (!(c == '"' || c == '\\')) {
    

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 30, 2019

    I forgot to mention, I was inspired by @christian.heimes 's talk at EuroPython 2019 https://ep2019.europython.eu/talks/es2pZ6C-introduction-to-low-level-profiling-and-tracing/ (thanks!)

    @ncoghlan
    Copy link
    Contributor

    New changeset 8a758f5 by Nick Coghlan (Marco Paolini) in branch 'master':
    bpo-37587: Make json.loads faster for long strings (GH-14752)
    8a758f5

    @ncoghlan
    Copy link
    Contributor

    I went ahead and merged the minimal PR and flagged it for backporting to 3.8 - it's an obviously beneficial change, that clearly does less work on each pass through the loop.

    Even if you are doing non-strict parsing of a string that consists entirely of invalid characters, you'll get the same level of performance that the previous code offered for all strict parsing.

    @miss-islington
    Copy link
    Contributor

    New changeset 9265a87 by Miss Islington (bot) in branch '3.8':
    bpo-37587: Make json.loads faster for long strings (GH-14752)
    9265a87

    @methane
    Copy link
    Member

    methane commented Jul 30, 2019

    Wait... there is no benchmark for the "minimum change".

    I tested 4 compilers, and provide much better patch in https://bugs.python.org/issue37587#msg348114

    @zooba
    Copy link
    Member

    zooba commented Jul 30, 2019

    While you're testing patches, can you try this version too?

            Py_UCS4 c = 0, minc = 0x20;
            for (next = end; next < len; next++) {
                c = PyUnicode_READ(kind, buf, next);
                if (c == '"' || c == '\\') {
                    break;
                }
                minc = c < minc ? c : minc;
            }
            if (strict && minc <= 0x1f) {
                raise_errmsg("Invalid control character at", pystr, next);
                goto bail;
            }

    When we tried this, the conditional expression became a "cmovl" operator which removed 3-4 branches from within the loop entirely, and it was 18% better than the baseline (which has now moved...)

    @zooba
    Copy link
    Member

    zooba commented Jul 30, 2019

    Oh, we also need to capture "next"... but then again, since the success case is far more common, I'd be okay with scanning again to find it.

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Jul 30, 2019

    @steve.dower yes, that's what made me discard that experiment we did during the sprint.

    Ok will test your new patch soon

    @methane
    Copy link
    Member

    methane commented Jul 30, 2019

    Since scope of "c" is very wide, and there is even &c in the scope, compiler stores the c to stack every time on:

      c = PyUnicode_READ(kind, buf, next);

    That is the bottleneck. if (strict && ...) is not the bottleneck.

    My patch used a new variable with tight scope so compiler can bypass store it to the stack.

    @zooba
    Copy link
    Member

    zooba commented Jul 30, 2019

    compiler stores the c to stack every time

    The disassembly we looked at didn't do this, so it may just be certain compilers. Perhaps we can actually use the register keyword to help them out? :)

    Here's a slightly altered one that doesn't require rescanning for the sake of the error message:

            Py_UCS4 c = 0, minc = strict ? 0x20 : 0x00;
            for (next = end; next < len; next++) {
                c = PyUnicode_READ(kind, buf, next);
                if (c == '"' || c == '\\' || c < minc) {
                    break;
                }
            }
            if (c < minc) {
                raise_errmsg("Invalid control character at", pystr, next);
                goto bail;
            }

    @methane
    Copy link
    Member

    methane commented Jul 30, 2019

    I tested before, after, Steve's patch, and my patch with gcc 8.3.0 and PGO+LTO.
    https://gist.github.com/methane/f6077bd1b0b04d40a9c790d9ed670a44#file-gcc-8-3-0-pgo-md

    Surprisingly, there is no difference. Even my patch didn't help register allocation when PGO is enabled.

    I will run same test with other compilers & PGO (enabled|disabled).

    @methane
    Copy link
    Member

    methane commented Jul 30, 2019

    I'm sorry, I was wrong. PGO did very nice job on all cases.
    gcc allocates c to register in the hot loop.

    @methane
    Copy link
    Member

    methane commented Jul 30, 2019

    This issue is very compiler sensitive.
    Please don't report performance without compiler version and PGO option.

    Now I'm facing strange behavior. pyperf reports slower time (1ms) for PGO builds, although disasm looks good.
    But it's 2:30am already... Please wait for a few days.

    @methane
    Copy link
    Member

    methane commented Aug 5, 2019

    I tried without PGO and confirmed performance improved on GCC 7.2.0.
    No change on other compiler versions.

    $ ./python -m pyperf timeit  -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)"

    old: 9211e2
    new: 8a758f

    gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0

    old: Mean +- std dev: 721 us +- 0 us
    new: Mean +- std dev: 722 us +- 0 us

    gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0

    old: Mean +- std dev: 1.03 ms +- 0.00 ms
    new: Mean +- std dev: 726 us +- 0 us

    clang version 7.0.1-8 (tags/RELEASE_701/final)

    old: Mean +- std dev: 721 us +- 1 us
    new: Mean +- std dev: 722 us +- 0 us

    clang version 8.0.0-3 (tags/RELEASE_800/final)

    old: Mean +- std dev: 721 us +- 0 us
    new: Mean +- std dev: 721 us +- 1 us

    @methane
    Copy link
    Member

    methane commented Aug 5, 2019

    And I confirmed performance improvement by my patch (GH-15134) on all of 4 compilers.

    $ ./python -m pyperf timeit  -s "import json; x = json.dumps({'k': '1' * 2 ** 20})" "json.loads(x)"

    old: 9211e2
    new: 8a758f
    opt2: 284e47

    gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0

    old: Mean +- std dev: 721 us +- 0 us
    new: Mean +- std dev: 722 us +- 0 us
    opt2: Mean +- std dev: 715 us +- 1 us

    gcc-7 (Ubuntu 7.4.0-8ubuntu1) 7.4.0

    old: Mean +- std dev: 1.03 ms +- 0.00 ms
    new: Mean +- std dev: 726 us +- 0 us
    opt2: Mean +- std dev: 715 us +- 0 us

    clang version 7.0.1-8 (tags/RELEASE_701/final)

    old: Mean +- std dev: 721 us +- 1 us
    new: Mean +- std dev: 722 us +- 0 us
    opt2: Mean +- std dev: 715 us +- 1 us

    clang version 8.0.0-3 (tags/RELEASE_800/final)

    old: Mean +- std dev: 721 us +- 0 us
    new: Mean +- std dev: 721 us +- 1 us
    opt2: Mean +- std dev: 715 us +- 0 us

    @methane
    Copy link
    Member

    methane commented Aug 8, 2019

    New changeset 2a570af by Inada Naoki in branch 'master':
    bpo-37587: optimize json.loads (GH-15134)
    2a570af

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Aug 15, 2019

    I also confirm Inada's patch further improves performance!

    All my previous benchmarks were done with gcc and PGO optimizations performed only with test_json task... maybe this explains the weird results?

    I tested the performance of new master 69f37bc with *all* PGO task reverting my original patch:

    iff --git a/Modules/_json.c b/Modules/_json.c
    index 112903ea57..9b63167276 100644
    --- a/Modules/_json.c
    +++ b/Modules/_json.c
    @@ -442,7 +442,7 @@ scanstring_unicode(PyObject *pystr, Py_ssize_t end, int strict, Py_ssize_t *next
                     if (d == '"' || d == '\\') {
                         break;
                     }
    -                if (d <= 0x1f && strict) {
    +                if (strict && d <= 0x1f) {
                         raise_errmsg("Invalid control character at", pystr, next);
                         goto bail;
                     }

    ... and surprise...

    Mean +- std dev: [69f37bc] 5.29 us +- 0.07 us -> [69f37bc-patched] 5.11 us +- 0.03 us: 1.04x faster (-4%)

    should we revert my original patch entirely now? Or am I missing something?

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Aug 15, 2019

    also worth noting escape sequences for non-ascii characters are slower, even when encoded length is the same.

    python -m pyperf timeit -s 'import json;' -s 'c = "€"; s = json.dumps(c * (2**10 // len(json.dumps(c)) - 2))' 'json.loads(s)' -o nonascii2k.json

    python -m pyperf timeit -s 'import json;' -s 'c = "a"; s = json.dumps(c * (2**10 // len(json.dumps(c)) - 2))' 'json.loads(s)' -o ascii2k.json

    Mean +- std dev: [ascii2k] 2.59 us +- 0.04 us -> [nonascii2k] 9.98 us +- 0.12 us: 3.86x slower (+286%)

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Aug 15, 2019

    ops sorry here's the right commands

    python -m pyperf timeit -s 'import json;' -s 'c = "a"; s = json.dumps(c * (2**10 // (len(json.dumps(c)) - 2)))' 'json.loads(s)' -o ascii2k.json
    python -m pyperf timeit -s 'import json;' -s 'c = "€"; s = json.dumps(c * (2**10 // (len(json.dumps(c)) - 2)))' 'json.loads(s)' -o nonascii2k.json

    Mean +- std dev: [ascii2k] 3.69 us +- 0.05 us -> [nonascii2k] 12.4 us +- 0.1 us: 3.35x slower (+235%)

    @mpaolini
    Copy link
    Mannequin Author

    mpaolini mannequin commented Aug 15, 2019

    ujson (https://github.com/esnme/ultrajson) instead is faster when decoding non-ascii in the same example above, so it is likely there is room for improvement...

    @methane
    Copy link
    Member

    methane commented Aug 29, 2019

    @mpaolini I don't have enough time in these weeks. Would you try PR-15591?

    I confirmed up to 4x speedup. But I'm afraid about there is performance regression in simple cases.

    @methane
    Copy link
    Member

    methane commented Oct 17, 2019

    New changeset 9c11029 by Inada Naoki in branch 'master':
    bpo-37587: json: Use _PyUnicodeWriter when scanning string. (GH-15591)
    9c11029

    @methane methane closed this as completed Oct 17, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants