classification
Title: JSON-serializing a large container takes too much memory
Type: resource usage Stage: resolved
Components: Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, pitrou, poq, python-dev, rhettinger
Priority: normal Keywords: patch

Created on 2011-08-18 15:23 by pitrou, last changed 2011-08-19 19:08 by poq. This issue is now closed.

Files
File name Uploaded Description Edit
jsonacc.patch pitrou, 2011-08-18 18:14 review
Messages (8)
msg142338 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-18 15:23
On a 8GB RAM box (more than 6GB free), serializing many small objects can eat all memory, while the end result would take around 600MB on an UCS2 build:

$ LANG=C time opt/python -c "import json; l = [1] * (100*1024*1024); encoded = json.dumps(l)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/antoine/cpython/opt/Lib/json/__init__.py", line 224, in dumps
    return _default_encoder.encode(obj)
  File "/home/antoine/cpython/opt/Lib/json/encoder.py", line 188, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/antoine/cpython/opt/Lib/json/encoder.py", line 246, in iterencode
    return _iterencode(o, 0)
MemoryError
Command exited with non-zero status 1
11.25user 2.43system 0:13.72elapsed 99%CPU (0avgtext+0avgdata 27820320maxresident)k
2920inputs+0outputs (12major+1261388minor)pagefaults 0swaps


I suppose the encoder internally builds a large list of very small unicode objects, and only joins them at the end. Probably we could join it by chunks so as to avoid this behaviour.
msg142360 - (view) Author: (poq) Date: 2011-08-18 16:35
I think this is because dumps() uses the C encoder. Making the C encoder incremental (i.e. iterator-based) like the Python encoder would solve this.

I actually looked into doing this for issue #12134, but it didn't seem so simple; Since C has no yield, I think the iterator would need to maintain its own stack to keep track of where it is in the object tree it's encoding...

If there is interest though, I may be able to write a patch when I have some time off again...
msg142390 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-18 18:14
This patch does the trick.
msg142455 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-19 14:09
> I actually looked into doing this for issue #12134, but it didn't seem 
> so simple; Since C has no yield, I think the iterator would need to
> maintain its own stack to keep track of where it is in the object tree 
> it's encoding...

The encoder doesn't have to be turned into an iterator. It would just need to call a given callable (fp.write) at regular intervals and that would be enough to C-accelerate dump().

My patch actually provides a good foundation for this.
msg142471 - (view) Author: Roundup Robot (python-dev) Date: 2011-08-19 16:05
New changeset 47176e8d7060 by Antoine Pitrou in branch 'default':
Issue #12778: Reduce memory consumption when JSON-encoding a large container of many small objects.
http://hg.python.org/cpython/rev/47176e8d7060
msg142484 - (view) Author: (poq) Date: 2011-08-19 18:14
> It would just need to call a given callable (fp.write) at regular intervals and that would be enough to C-accelerate dump().

True, but that would just special case dump(), just like dumps() is special-cased now. Ideally JSONEncoder.iterencode() would be accelerated, so you wouldn't need any special cases. Or deprecate iterencode() and replace it with a callback interface...
msg142486 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-19 18:28
> > It would just need to call a given callable (fp.write) at regular
> intervals and that would be enough to C-accelerate dump().
> 
> True, but that would just special case dump(), just like dumps() is
> special-cased now. Ideally JSONEncoder.iterencode() would be
> accelerated, so you wouldn't need any special cases. Or deprecate
> iterencode() and replace it with a callback interface...

Is iterencode() used much? I would think dump() and dumps() see the most
use.
msg142489 - (view) Author: (poq) Date: 2011-08-19 19:08
> Is iterencode() used much? I would think dump() and dumps() see the most use.

Of course. I'd just prefer an elegant & complete solution. But I agree accelerating just dump() would already be much better than the current situation.
History
Date User Action Args
2011-08-19 19:08:19poqsetmessages: + msg142489
2011-08-19 18:28:30pitrousetmessages: + msg142486
2011-08-19 18:14:17poqsetmessages: + msg142484
2011-08-19 16:09:39pitrousetstatus: open -> closed
resolution: fixed
stage: resolved
2011-08-19 16:05:55python-devsetnosy: + python-dev
messages: + msg142471
2011-08-19 14:09:52pitrousetmessages: + msg142455
2011-08-18 18:14:48pitrousetfiles: + jsonacc.patch
keywords: + patch
messages: + msg142390
2011-08-18 16:35:01poqsetnosy: + poq
messages: + msg142360
2011-08-18 15:23:15pitroucreate