classification
Title: json C serializer performance tied to structure depth on some systems
Type: performance Stage: needs patch
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: pitrou Nosy List: bob.ippolito, pitrou, rhettinger, swalker, vkuznet
Priority: low Keywords: patch

Created on 2009-07-28 20:19 by swalker, last changed 2010-12-01 02:47 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
catalog.dependency.C.gz swalker, 2009-12-07 21:01 Sample Data File
Messages (27)
msg91015 - (view) Author: Shawn (swalker) Date: 2009-07-28 20:19
The json serializer's performance (when using the C speedups) appears to
be tied to the depth of the structure being serialized on some systems.
 In particular, dict structure that are more than a few levels deep,
especially when they content mixed values (lists, strings, and other
dicts) causes severe serialization penalties (in the neighborhood of an
extra 20-30 seconds) on some systems.

On SPARC systems, this is very likely because of the recursive call
structure that the C speedups serializer uses which doesn't work well
with SPARC due to register windows.

On x86 systems, recursive call structures are cheap, so this doesn't
appear to be an issue there.

SPARC systems with higher amounts of memory bandwidth don't suffer as
badly from this (such as Niagra T1000, T2000, etc. systems), but older
UltraSparc systems are especially prone to performance issues.
msg91308 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-08-05 10:43
I'm not sure there's anything we should do about this. Some
architectures are unreasonably slow at some things, and the old SPARC
implementations are a niche nowadays. I suppose you may witness the same
kinds of slowdowns if you use cPickle rather than json.

(I'm looking for the "abysmally low" priority setting :-))
msg91326 - (view) Author: Shawn (swalker) Date: 2009-08-05 18:51
As I mentioned, there's also noticeable performance penalties on recent
SPARC systems, such as Niagra T1000, T2000, etc.  The degradation is
just less obvious (a 10-15 second penalty instead of a 20 or 30 second
penalty).  While x86 enjoys no penalty at all (in my testing so far).

Here's an example of the data structure:

    {    
      "packages":{
        "package-name-1":{
          "publisher":"publisher-name-1",
          "versions":[
            [    
              "0.5.11,5.11-0.86:20080422T230436Z",
              {    
                "depend":{
                  "require":[
                    {    
                      "fmri":"foo"
                    },   
                    {    
                      "fmri":"bar"
                    }    
                  ],   
                  "optional":[
                    {    
                      "fmri":"baz"
                    },   
                    {    
                      "fmri":"quux"
                    }    
                  ]    
                }    
              }    
            ],   
          ]    
        }    
      }    
    }    

Now imagine that there are 45,000 package-name-x entries in the
structure above, and that basically replicates what I'm writing.

If I turn the above structure into a list of lists instead, the penalty
is significantly reduced (halved at least).  If I flatten the stack even
farther, the penalty is essentially gone.  The greater the depth of the
data structure, the greater the penalty.

As for priority, I wouldn't call this "end of the world", but it does
create some unfortunate surprises for developers that rely on the json
module.  Given the long service lifetimes of SPARC systems (due to cost
:)), I suspect this would be of benefit for a long time to come.
msg91328 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-08-05 19:35
Are you sure that recursion depth is the issue?  Have you tried the same
number and kind of objects listed serially (unnested)?  This would help
rule-out memory allocation issues and would instead confirm that it has
something to do with the C stack.

It would be helpful if you uploaded your test data strings and timing
suite.  Are you able to run a C profile so we can tell where the hotspot
is?  Can you run PyYAML over the same data to see if it is similarly
afflicted (yaml is a superset of json).

Also, try timing a repr() serialization of the same data,
x=repr(rootobj).  The repr code also uses recursion and it has to build
a big string in memory.  It has to visit every node, so it will reveal
whether memory cache misses are the culprit.  

Try your timings with GC turned-off so that we can rule that out.

Do you have some option to compile with an alternate memory allocator
(such as dlmalloc).  A crummy memory allocator may be the issue since
serialization entails creating many small strings, then joining and
resizing them.

Also, try serializing to /dev/null so that we can exclude fileio issues
(buffering and whatnot).

Sorry for all the requests, but there are many possible culprits and I
think it unlikely that recursion is the cause (much of the code in
Python works recursively -- everything from repr to gc -- so if that
were the problem, everything would run slower, not just json serialization).
msg91377 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-08-06 18:12
As Raymond said, and besides, when you talk about "penalty", please
explain what the baseline is. Otherwise it's a bit hard to follow.

(and I stress again that SPARC is a nich platform, even Niagara :-);
moreover, Niagara is throughput-oriented rather than latency-oriented,
so poor single-threaded performance shouldn't be much of a surprise)
msg91379 - (view) Author: Shawn (swalker) Date: 2009-08-06 18:46
First, I want to apologise for not providing more detail initially. 
Notably, one thing you may want to be aware of is that I'm using python
2.4.4 with the latest version of simplejson.  So my timings and
assumptions here are based on the fact that simplejson was adopted as
the 'json' module in python, and I filed the bug here as it appeared
that is where bugs are being tracked for the json module.

To answer your questions though, no, I can't say with certainty that
recursion depth is the issue.  That's just a theory proposed by a
developer intimately familiar with SPARC architecture, who said register
windows on SPARC tend to cause recursive call structures to execute
poorly.  It also seemed to play itself out empirically throughout
testing I performed where any reduction in the depth of the structure
would shave seconds off the write times on the SPARC systems I tested.

I'm also willing to try many of the other things you listed, but I will
have to get back to you on that as I have a project coming due soon.

With that said, I can provide sample data soon, and will do so.  I'll
attach the resulting gzip'd JSON file to make it easy to read and dump.

I would also note that:

* I have tried serialising using cStringIO, which made no significant
difference in performance.

* I have tried different memory allocators, which only seemed to make
things slower, or made little difference.

* Writing roughly the same amount of data (in terms of megabytes), but
in a flatter structure, also increased the performance of the serializer.

* In my testing, it seemed dict serialisation in particular was
problematic from a performance standpoint.

* If I recall correctly from the profile I did, iterencode_dict was
where most of the time was eaten, but I can redo the profile for a more
accurate analysis.

As for Antoine's comments:

I'd like to believe Python is very useful software, and any platform it
it runs on means that the respective market capitalization of the
platform is irrelevant; better performing Python is always good.
msg95523 - (view) Author: Valentin Kuznetsov (vkuznet) Date: 2009-11-20 02:22
Hi,
I just found this bug and would like to add my experience with 
performance of large JSON docs. I have a few JSON docs about 180MB in 
size which I read from data-services. I use python2.6, run on Linux, 64-
bit node w/ 16GB of RAM and 8 core CPU, Intel Xeon 2.33GHz each. I used 
both json and cjson modules to parse my documents. My observation that 
the amount of RAM used to parse such docs is about 2GB, which is a way 
too much. The total time spent about 30 seconds (using cjson). The 
content of my docs are very mixed, lists, strings, other dicts. I can 
provide them if it will be required, but it's 200MB :)

For comparison, I got the same data in XML and using 
cElementTree.iterparse I stay w/ 300MB RAM usage per doc, which is 
really reasonable to me.

I can provide some benchmarks and perform such tests if it will be 
required.
msg95524 - (view) Author: Bob Ippolito (bob.ippolito) * (Python committer) Date: 2009-11-20 02:35
Did you try the trunk of simplejson? It doesn't work quite the same way as 
the current json module in Python 2.6+.

Without the data or a tool to produce data that causes the problem, there 
isn't much I can do to help.
msg95915 - (view) Author: Valentin Kuznetsov (vkuznet) Date: 2009-12-02 18:47
Hi,
I'm sorry for delay, I was busy. Here is a test data file:
http://www.lns.cornell.edu/~vk/files/mangled.json

Its size is 150 MB, 50MB less of original, due to scrambled values I was 
forced to do.

The tests with stock json module in python 2.6.2 is 2GB
source = open('mangled.json', 'r')
data = json.load(source)

Using simplejson 2.0.9 from PyPi I saw the same performance, please note 
_speedups.so C module was compiled.

Using cjson module, I observed 180MB of RAM utilization
source = open('mangled.json', 'r')
data = cjson.encode(source.read())

cjson is about 10 times faster!

I re-factor code which deals with XML version of the same data and I was 
able to process it using cElementTree only using 20MB (!) of RAM.
msg95919 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-02 22:19
> Using cjson module, I observed 180MB of RAM utilization
> source = open('mangled.json', 'r')
> data = cjson.encode(source.read())
> 
> cjson is about 10 times faster!

This is simply wrong. You should be using cjson.decode(), not
cjson.encode().
If you do so, you will see that cjson tajes as much as memory as
simplejson and is actually a bit slower.

Looking at your json file, I would have a couple of suggestions:
- don't quote integers and floats, so that they are decoded as Python
ints and floats rather than strings
- if the same structure is used a large number of times, don't use
objects, use lists instead
msg95920 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-02 22:32
That said, it is possible to further improve json by reducing the number
of memory allocations and temporary copies. Here is an experimental
(meaning: not polished) patch which gains 40% in decoding speed in your
example (9 seconds versus 15).

We could also add an option to intern object keys when decoding (which
wins 400MB in your example); or, alternatively, have an internal "memo"
remembering already seen keys and avoiding duplicates.
msg95921 - (view) Author: Valentin Kuznetsov (vkuznet) Date: 2009-12-02 22:53
Oops, that's explain why I saw such small memory usage with cjson. I 
constructed tests on a fly.

Regarding the data structure. Unfortunately it's out of my hands. The 
data comes from data-service. So, I can't do much and can only report to 
developers. 

I'll try your patch tomorrow. Obviously it's a huge gain, both in memory 
footprint and CPU usage.

Thanks.
Valentin.
msg95977 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-05 01:00
Here is a new patch with an internal memo dict to reuse equal keys, and
some tests.
msg96053 - (view) Author: Valentin Kuznetsov (vkuznet) Date: 2009-12-07 15:02
Antoine,
indeed, both patches improved time and memory foot print. The latest 
patch shows only 1.1GB RAM usage and is very fast. What's worry me 
though, that memory is not released back to the system. Is this is the 
case? I just added time.sleep after json.load and saw that once decoding 
is done, the resident size still remain the same.
msg96055 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-07 15:25
> Antoine,
> indeed, both patches improved time and memory foot print. The latest 
> patch shows only 1.1GB RAM usage and is very fast. What's worry me 
> though, that memory is not released back to the system. Is this is the 
> case? I just added time.sleep after json.load and saw that once decoding 
> is done, the resident size still remain the same.

Interesting. Does it release memory without the patch?
msg96056 - (view) Author: Valentin Kuznetsov (vkuznet) Date: 2009-12-07 15:38
Nope, all three json's implementation do not release the memory. I used 
your patched one, the one shipped with 2.6 and cjson. The one which comes 
with 2.6, reach 2GB, then release 200MB and stays with 1.8GB during 
sleep. The cjson reaches 1.5GB mark and stays there. But all three 
release another 100-200MB just before the exit (one top cycle before 
process disappear). I used sleep of 20 seconds, so I'm pretty sure memory 
was not released during that time, since I watched the process with idle 
CPU.
msg96057 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-07 15:43
> Nope, all three json's implementation do not release the memory. I used 
> your patched one, the one shipped with 2.6 and cjson. The one which comes 
> with 2.6, reach 2GB, then release 200MB and stays with 1.8GB during 
> sleep. The cjson reaches 1.5GB mark and stays there. But all three 
> release another 100-200MB just before the exit (one top cycle before 
> process disappear). I used sleep of 20 seconds, so I'm pretty sure memory 
> was not released during that time, since I watched the process with idle 
> CPU.

Do you destroy the decoded data, though? If you keep it in memory
there's no chance that a lot of memory will be released.
msg96059 - (view) Author: Valentin Kuznetsov (vkuznet) Date: 2009-12-07 15:51
I made data local, but adding del shows the same behavior.
This is the test

def test():
    source = open('mangled.json', 'r')
    data = json.load(source)
    source.close()
    del data
test()
time.sleep(20)
msg96068 - (view) Author: Shawn (swalker) Date: 2009-12-07 20:40
The attached patch doubles write times for my particular case when
applied to simplejson trunk using python 2.6.2.  Not good.
msg96070 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-07 20:43
> The attached patch doubles write times for my particular case when
> applied to simplejson trunk using python 2.6.2.  Not good.

What do you mean by "write times"? The patch only affects decoding.
msg96071 - (view) Author: Shawn (swalker) Date: 2009-12-07 20:55
You are right, an environment anomaly let me to falsely believe that
this had somehow affected encoding performance.

I had repeated the test many times with and without the patch using
simplejson trunk and wrongly concluded that the patch was to blame.

After correcting the environment, write performance returned to normal.

This patch seems to perform roughly the same for my decode cases, but
uses about 10-20MB less memory.  My needs are far less than that of the
other poster.

However, this bug is about the serializer (encoder).  So perhaps the
decode performance patch should be a separate bug?
msg96073 - (view) Author: Shawn (swalker) Date: 2009-12-07 21:01
I've attached a sample JSON file that is much slower to write out on
some systems as described in the initial comment.

If you were to restructure the contents of this file into more of a tree
structure instead of the flat array structure it uses now, you will
notice that as the depth increases, serializer performance decreases
significantly.
msg96083 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-07 23:09
> However, this bug is about the serializer (encoder).  So perhaps the
> decode performance patch should be a separate bug?

You're right, I've filed a separate bug for it: issue7451.
msg96088 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-07 23:41
Your example takes 0.5s to dump here.
msg122956 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-30 23:37
Antoine, what do you want to do with the one?  Without a good test case the OP's original issue is undiagnosable.
msg122958 - (view) Author: Shawn (swalker) Date: 2010-11-30 23:49
I specifically mentioned *SPARC* as the performance problem area, but the reply about "0.5s to dump" fails to mention on what platform they tested

My problem is not "undiagnosable".  I'll be happy to provide you with even more data files.  But I believe that there is a problem here on some architectures for reasons other than those of simple differences in single-threaded performance that could be accounted to processor architecture.

As an example of something that makes a noticeable difference on SPARC systems I've checked:

178     # could accelerate with writelines in some versions of Python, at
179     # a debuggability cost
180     for chunk in iterable:
181         fp.write(chunk)

Changing that to use writelines() is a significant win.  You go from over a million calls to write (for small bits as simple as a single character such as '{') to one single call to writelines() with an iterable.

The recursive call structure of the json code also adds significant overhead on some architectures.

What's "undiagnosable" here is the response to the issue I reported -- it provides no information about the platform that was tested or how the testing was done.

My testing was done by reading the attached file using json, and then timing the results of writing it back out (to /tmp mind you, which is memory-backed on my OS platform, so no disk I/O was involved, I've also checked writing to a cStringIO object).
msg122972 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-12-01 02:47
Raymond, I'll follow up in private with Shawn. All the recent performance improvements done on JSON (in 3.2) mean the issue can be closed IMO.
History
Date User Action Args
2010-12-01 02:47:13pitrousetstatus: open -> closed
resolution: out of date
messages: + msg122972
2010-11-30 23:49:12swalkersetmessages: + msg122958
2010-11-30 23:37:25rhettingersetassignee: rhettinger -> pitrou
messages: + msg122956
versions: + Python 3.1
2009-12-07 23:41:48pitrousetmessages: + msg96088
2009-12-07 23:09:27pitrousetfiles: - json-opts2.patch
2009-12-07 23:09:17pitrousetmessages: + msg96083
stage: patch review -> needs patch
2009-12-07 21:01:03swalkersetfiles: + catalog.dependency.C.gz

messages: + msg96073
2009-12-07 20:55:22swalkersetmessages: + msg96071
2009-12-07 20:43:08pitrousetmessages: + msg96070
2009-12-07 20:40:46swalkersetmessages: + msg96068
2009-12-07 15:51:03vkuznetsetmessages: + msg96059
2009-12-07 15:43:10pitrousetmessages: + msg96057
2009-12-07 15:38:32vkuznetsetmessages: + msg96056
2009-12-07 15:25:41pitrousetmessages: + msg96055
2009-12-07 15:02:51vkuznetsetmessages: + msg96053
2009-12-05 01:03:30pitrousetfiles: - json-opts.patch
2009-12-05 01:00:50pitrousetfiles: + json-opts2.patch

stage: patch review
messages: + msg95977
versions: + Python 3.2
2009-12-02 22:53:17vkuznetsetmessages: + msg95921
2009-12-02 22:32:21pitrousetfiles: + json-opts.patch
keywords: + patch
messages: + msg95920
2009-12-02 22:19:15pitrousetmessages: + msg95919
2009-12-02 18:47:06vkuznetsetmessages: + msg95915
2009-11-20 02:35:35bob.ippolitosetmessages: + msg95524
2009-11-20 02:22:59vkuznetsetnosy: + vkuznet
messages: + msg95523
2009-08-06 18:46:49swalkersetmessages: + msg91379
versions: - Python 3.2
2009-08-06 18:12:59pitrousetmessages: + msg91377
2009-08-05 19:35:44rhettingersetassignee: bob.ippolito -> rhettinger

messages: + msg91328
nosy: + rhettinger
2009-08-05 18:51:46swalkersetmessages: + msg91326
2009-08-05 10:43:20pitrousetversions: + Python 2.7, Python 3.2, - Python 2.6
nosy: + bob.ippolito, pitrou

messages: + msg91308

assignee: bob.ippolito
2009-07-28 21:15:24brett.cannonsetpriority: low
2009-07-28 20:19:20swalkercreate