Issue 6594: json C serializer performance tied to structure depth on some systems

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50843

classification

Title:	json C serializer performance tied to structure depth on some systems
Type:	performance	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2, Python 2.7

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:	pitrou	Nosy List:	bob.ippolito, pitrou, rhettinger, swalker, vkuznet
Priority:	low	Keywords:	patch

Created on 2009-07-28 20:19 by swalker, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
catalog.dependency.C.gz	swalker, 2009-12-07 21:01	Sample Data File

Messages (27)
msg91015 - (view)	Author: Shawn (swalker)	Date: 2009-07-28 20:19
The json serializer's performance (when using the C speedups) appears to be tied to the depth of the structure being serialized on some systems. In particular, dict structure that are more than a few levels deep, especially when they content mixed values (lists, strings, and other dicts) causes severe serialization penalties (in the neighborhood of an extra 20-30 seconds) on some systems. On SPARC systems, this is very likely because of the recursive call structure that the C speedups serializer uses which doesn't work well with SPARC due to register windows. On x86 systems, recursive call structures are cheap, so this doesn't appear to be an issue there. SPARC systems with higher amounts of memory bandwidth don't suffer as badly from this (such as Niagra T1000, T2000, etc. systems), but older UltraSparc systems are especially prone to performance issues.
msg91308 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-08-05 10:43
I'm not sure there's anything we should do about this. Some architectures are unreasonably slow at some things, and the old SPARC implementations are a niche nowadays. I suppose you may witness the same kinds of slowdowns if you use cPickle rather than json. (I'm looking for the "abysmally low" priority setting :-))
msg91326 - (view)	Author: Shawn (swalker)	Date: 2009-08-05 18:51
As I mentioned, there's also noticeable performance penalties on recent SPARC systems, such as Niagra T1000, T2000, etc. The degradation is just less obvious (a 10-15 second penalty instead of a 20 or 30 second penalty). While x86 enjoys no penalty at all (in my testing so far). Here's an example of the data structure: { "packages":{ "package-name-1":{ "publisher":"publisher-name-1", "versions":[ [ "0.5.11,5.11-0.86:20080422T230436Z", { "depend":{ "require":[ { "fmri":"foo" }, { "fmri":"bar" } ], "optional":[ { "fmri":"baz" }, { "fmri":"quux" } ] } } ], ] } } } Now imagine that there are 45,000 package-name-x entries in the structure above, and that basically replicates what I'm writing. If I turn the above structure into a list of lists instead, the penalty is significantly reduced (halved at least). If I flatten the stack even farther, the penalty is essentially gone. The greater the depth of the data structure, the greater the penalty. As for priority, I wouldn't call this "end of the world", but it does create some unfortunate surprises for developers that rely on the json module. Given the long service lifetimes of SPARC systems (due to cost :)), I suspect this would be of benefit for a long time to come.
msg91328 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-08-05 19:35
Are you sure that recursion depth is the issue? Have you tried the same number and kind of objects listed serially (unnested)? This would help rule-out memory allocation issues and would instead confirm that it has something to do with the C stack. It would be helpful if you uploaded your test data strings and timing suite. Are you able to run a C profile so we can tell where the hotspot is? Can you run PyYAML over the same data to see if it is similarly afflicted (yaml is a superset of json). Also, try timing a repr() serialization of the same data, x=repr(rootobj). The repr code also uses recursion and it has to build a big string in memory. It has to visit every node, so it will reveal whether memory cache misses are the culprit. Try your timings with GC turned-off so that we can rule that out. Do you have some option to compile with an alternate memory allocator (such as dlmalloc). A crummy memory allocator may be the issue since serialization entails creating many small strings, then joining and resizing them. Also, try serializing to /dev/null so that we can exclude fileio issues (buffering and whatnot). Sorry for all the requests, but there are many possible culprits and I think it unlikely that recursion is the cause (much of the code in Python works recursively -- everything from repr to gc -- so if that were the problem, everything would run slower, not just json serialization).
msg91377 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-08-06 18:12
As Raymond said, and besides, when you talk about "penalty", please explain what the baseline is. Otherwise it's a bit hard to follow. (and I stress again that SPARC is a nich platform, even Niagara :-); moreover, Niagara is throughput-oriented rather than latency-oriented, so poor single-threaded performance shouldn't be much of a surprise)
msg91379 - (view)	Author: Shawn (swalker)	Date: 2009-08-06 18:46
First, I want to apologise for not providing more detail initially. Notably, one thing you may want to be aware of is that I'm using python 2.4.4 with the latest version of simplejson. So my timings and assumptions here are based on the fact that simplejson was adopted as the 'json' module in python, and I filed the bug here as it appeared that is where bugs are being tracked for the json module. To answer your questions though, no, I can't say with certainty that recursion depth is the issue. That's just a theory proposed by a developer intimately familiar with SPARC architecture, who said register windows on SPARC tend to cause recursive call structures to execute poorly. It also seemed to play itself out empirically throughout testing I performed where any reduction in the depth of the structure would shave seconds off the write times on the SPARC systems I tested. I'm also willing to try many of the other things you listed, but I will have to get back to you on that as I have a project coming due soon. With that said, I can provide sample data soon, and will do so. I'll attach the resulting gzip'd JSON file to make it easy to read and dump. I would also note that: * I have tried serialising using cStringIO, which made no significant difference in performance. * I have tried different memory allocators, which only seemed to make things slower, or made little difference. * Writing roughly the same amount of data (in terms of megabytes), but in a flatter structure, also increased the performance of the serializer. * In my testing, it seemed dict serialisation in particular was problematic from a performance standpoint. * If I recall correctly from the profile I did, iterencode_dict was where most of the time was eaten, but I can redo the profile for a more accurate analysis. As for Antoine's comments: I'd like to believe Python is very useful software, and any platform it it runs on means that the respective market capitalization of the platform is irrelevant; better performing Python is always good.
msg95523 - (view)	Author: Valentin Kuznetsov (vkuznet)	Date: 2009-11-20 02:22
Hi, I just found this bug and would like to add my experience with performance of large JSON docs. I have a few JSON docs about 180MB in size which I read from data-services. I use python2.6, run on Linux, 64- bit node w/ 16GB of RAM and 8 core CPU, Intel Xeon 2.33GHz each. I used both json and cjson modules to parse my documents. My observation that the amount of RAM used to parse such docs is about 2GB, which is a way too much. The total time spent about 30 seconds (using cjson). The content of my docs are very mixed, lists, strings, other dicts. I can provide them if it will be required, but it's 200MB :) For comparison, I got the same data in XML and using cElementTree.iterparse I stay w/ 300MB RAM usage per doc, which is really reasonable to me. I can provide some benchmarks and perform such tests if it will be required.
msg95524 - (view)	Author: Bob Ippolito (bob.ippolito) *	Date: 2009-11-20 02:35
Did you try the trunk of simplejson? It doesn't work quite the same way as the current json module in Python 2.6+. Without the data or a tool to produce data that causes the problem, there isn't much I can do to help.
msg95915 - (view)	Author: Valentin Kuznetsov (vkuznet)	Date: 2009-12-02 18:47
Hi, I'm sorry for delay, I was busy. Here is a test data file: http://www.lns.cornell.edu/~vk/files/mangled.json Its size is 150 MB, 50MB less of original, due to scrambled values I was forced to do. The tests with stock json module in python 2.6.2 is 2GB source = open('mangled.json', 'r') data = json.load(source) Using simplejson 2.0.9 from PyPi I saw the same performance, please note _speedups.so C module was compiled. Using cjson module, I observed 180MB of RAM utilization source = open('mangled.json', 'r') data = cjson.encode(source.read()) cjson is about 10 times faster! I re-factor code which deals with XML version of the same data and I was able to process it using cElementTree only using 20MB (!) of RAM.
msg95919 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-02 22:19
> Using cjson module, I observed 180MB of RAM utilization > source = open('mangled.json', 'r') > data = cjson.encode(source.read()) > > cjson is about 10 times faster! This is simply wrong. You should be using cjson.decode(), not cjson.encode(). If you do so, you will see that cjson tajes as much as memory as simplejson and is actually a bit slower. Looking at your json file, I would have a couple of suggestions: - don't quote integers and floats, so that they are decoded as Python ints and floats rather than strings - if the same structure is used a large number of times, don't use objects, use lists instead
msg95920 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-02 22:32
That said, it is possible to further improve json by reducing the number of memory allocations and temporary copies. Here is an experimental (meaning: not polished) patch which gains 40% in decoding speed in your example (9 seconds versus 15). We could also add an option to intern object keys when decoding (which wins 400MB in your example); or, alternatively, have an internal "memo" remembering already seen keys and avoiding duplicates.
msg95921 - (view)	Author: Valentin Kuznetsov (vkuznet)	Date: 2009-12-02 22:53
Oops, that's explain why I saw such small memory usage with cjson. I constructed tests on a fly. Regarding the data structure. Unfortunately it's out of my hands. The data comes from data-service. So, I can't do much and can only report to developers. I'll try your patch tomorrow. Obviously it's a huge gain, both in memory footprint and CPU usage. Thanks. Valentin.
msg95977 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-05 01:00
Here is a new patch with an internal memo dict to reuse equal keys, and some tests.
msg96053 - (view)	Author: Valentin Kuznetsov (vkuznet)	Date: 2009-12-07 15:02
Antoine, indeed, both patches improved time and memory foot print. The latest patch shows only 1.1GB RAM usage and is very fast. What's worry me though, that memory is not released back to the system. Is this is the case? I just added time.sleep after json.load and saw that once decoding is done, the resident size still remain the same.
msg96055 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-07 15:25
> Antoine, > indeed, both patches improved time and memory foot print. The latest > patch shows only 1.1GB RAM usage and is very fast. What's worry me > though, that memory is not released back to the system. Is this is the > case? I just added time.sleep after json.load and saw that once decoding > is done, the resident size still remain the same. Interesting. Does it release memory without the patch?
msg96056 - (view)	Author: Valentin Kuznetsov (vkuznet)	Date: 2009-12-07 15:38
Nope, all three json's implementation do not release the memory. I used your patched one, the one shipped with 2.6 and cjson. The one which comes with 2.6, reach 2GB, then release 200MB and stays with 1.8GB during sleep. The cjson reaches 1.5GB mark and stays there. But all three release another 100-200MB just before the exit (one top cycle before process disappear). I used sleep of 20 seconds, so I'm pretty sure memory was not released during that time, since I watched the process with idle CPU.
msg96057 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-07 15:43
> Nope, all three json's implementation do not release the memory. I used > your patched one, the one shipped with 2.6 and cjson. The one which comes > with 2.6, reach 2GB, then release 200MB and stays with 1.8GB during > sleep. The cjson reaches 1.5GB mark and stays there. But all three > release another 100-200MB just before the exit (one top cycle before > process disappear). I used sleep of 20 seconds, so I'm pretty sure memory > was not released during that time, since I watched the process with idle > CPU. Do you destroy the decoded data, though? If you keep it in memory there's no chance that a lot of memory will be released.
msg96059 - (view)	Author: Valentin Kuznetsov (vkuznet)	Date: 2009-12-07 15:51
I made data local, but adding del shows the same behavior. This is the test def test(): source = open('mangled.json', 'r') data = json.load(source) source.close() del data test() time.sleep(20)
msg96068 - (view)	Author: Shawn (swalker)	Date: 2009-12-07 20:40
The attached patch doubles write times for my particular case when applied to simplejson trunk using python 2.6.2. Not good.
msg96070 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-07 20:43
> The attached patch doubles write times for my particular case when > applied to simplejson trunk using python 2.6.2. Not good. What do you mean by "write times"? The patch only affects decoding.
msg96071 - (view)	Author: Shawn (swalker)	Date: 2009-12-07 20:55
You are right, an environment anomaly let me to falsely believe that this had somehow affected encoding performance. I had repeated the test many times with and without the patch using simplejson trunk and wrongly concluded that the patch was to blame. After correcting the environment, write performance returned to normal. This patch seems to perform roughly the same for my decode cases, but uses about 10-20MB less memory. My needs are far less than that of the other poster. However, this bug is about the serializer (encoder). So perhaps the decode performance patch should be a separate bug?
msg96073 - (view)	Author: Shawn (swalker)	Date: 2009-12-07 21:01
I've attached a sample JSON file that is much slower to write out on some systems as described in the initial comment. If you were to restructure the contents of this file into more of a tree structure instead of the flat array structure it uses now, you will notice that as the depth increases, serializer performance decreases significantly.
msg96083 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-07 23:09
> However, this bug is about the serializer (encoder). So perhaps the > decode performance patch should be a separate bug? You're right, I've filed a separate bug for it: issue7451.
msg96088 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-07 23:41
Your example takes 0.5s to dump here.
msg122956 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-11-30 23:37
Antoine, what do you want to do with the one? Without a good test case the OP's original issue is undiagnosable.
msg122958 - (view)	Author: Shawn (swalker)	Date: 2010-11-30 23:49
I specifically mentioned SPARC as the performance problem area, but the reply about "0.5s to dump" fails to mention on what platform they tested My problem is not "undiagnosable". I'll be happy to provide you with even more data files. But I believe that there is a problem here on some architectures for reasons other than those of simple differences in single-threaded performance that could be accounted to processor architecture. As an example of something that makes a noticeable difference on SPARC systems I've checked: 178 # could accelerate with writelines in some versions of Python, at 179 # a debuggability cost 180 for chunk in iterable: 181 fp.write(chunk) Changing that to use writelines() is a significant win. You go from over a million calls to write (for small bits as simple as a single character such as '{') to one single call to writelines() with an iterable. The recursive call structure of the json code also adds significant overhead on some architectures. What's "undiagnosable" here is the response to the issue I reported -- it provides no information about the platform that was tested or how the testing was done. My testing was done by reading the attached file using json, and then timing the results of writing it back out (to /tmp mind you, which is memory-backed on my OS platform, so no disk I/O was involved, I've also checked writing to a cStringIO object).
msg122972 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-12-01 02:47
Raymond, I'll follow up in private with Shawn. All the recent performance improvements done on JSON (in 3.2) mean the issue can be closed IMO.

History
Date	User	Action	Args
2022-04-11 14:56:51	admin	set	github: 50843
2010-12-01 02:47:13	pitrou	set	status: open -> closed resolution: out of date messages: + msg122972
2010-11-30 23:49:12	swalker	set	messages: + msg122958
2010-11-30 23:37:25	rhettinger	set	assignee: rhettinger -> pitrou messages: + msg122956 versions: + Python 3.1
2009-12-07 23:41:48	pitrou	set	messages: + msg96088
2009-12-07 23:09:27	pitrou	set	files: - json-opts2.patch
2009-12-07 23:09:17	pitrou	set	messages: + msg96083 stage: patch review -> needs patch
2009-12-07 21:01:03	swalker	set	files: + catalog.dependency.C.gz messages: + msg96073
2009-12-07 20:55:22	swalker	set	messages: + msg96071
2009-12-07 20:43:08	pitrou	set	messages: + msg96070
2009-12-07 20:40:46	swalker	set	messages: + msg96068
2009-12-07 15:51:03	vkuznet	set	messages: + msg96059
2009-12-07 15:43:10	pitrou	set	messages: + msg96057
2009-12-07 15:38:32	vkuznet	set	messages: + msg96056
2009-12-07 15:25:41	pitrou	set	messages: + msg96055
2009-12-07 15:02:51	vkuznet	set	messages: + msg96053
2009-12-05 01:03:30	pitrou	set	files: - json-opts.patch
2009-12-05 01:00:50	pitrou	set	files: + json-opts2.patch stage: patch review messages: + msg95977 versions: + Python 3.2
2009-12-02 22:53:17	vkuznet	set	messages: + msg95921
2009-12-02 22:32:21	pitrou	set	files: + json-opts.patch keywords: + patch messages: + msg95920
2009-12-02 22:19:15	pitrou	set	messages: + msg95919
2009-12-02 18:47:06	vkuznet	set	messages: + msg95915
2009-11-20 02:35:35	bob.ippolito	set	messages: + msg95524
2009-11-20 02:22:59	vkuznet	set	nosy: + vkuznet messages: + msg95523
2009-08-06 18:46:49	swalker	set	messages: + msg91379 versions: - Python 3.2
2009-08-06 18:12:59	pitrou	set	messages: + msg91377
2009-08-05 19:35:44	rhettinger	set	assignee: bob.ippolito -> rhettinger messages: + msg91328 nosy: + rhettinger
2009-08-05 18:51:46	swalker	set	messages: + msg91326
2009-08-05 10:43:20	pitrou	set	versions: + Python 2.7, Python 3.2, - Python 2.6 nosy: + bob.ippolito, pitrou messages: + msg91308 assignee: bob.ippolito
2009-07-28 21:15:24	brett.cannon	set	priority: low
2009-07-28 20:19:20	swalker	create