Issue 12778: JSON-serializing a large container takes too much memory

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56987

classification

Title:	JSON-serializing a large container takes too much memory
Type:	resource usage	Stage:	resolved
Components:		Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, pitrou, poq, python-dev, rhettinger
Priority:	normal	Keywords:	patch

Created on 2011-08-18 15:23 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
jsonacc.patch	pitrou, 2011-08-18 18:14		review

Messages (8)
msg142338 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-08-18 15:23
On a 8GB RAM box (more than 6GB free), serializing many small objects can eat all memory, while the end result would take around 600MB on an UCS2 build: $ LANG=C time opt/python -c "import json; l = [1] * (10010241024); encoded = json.dumps(l)" Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/antoine/cpython/opt/Lib/json/__init__.py", line 224, in dumps return _default_encoder.encode(obj) File "/home/antoine/cpython/opt/Lib/json/encoder.py", line 188, in encode chunks = self.iterencode(o, _one_shot=True) File "/home/antoine/cpython/opt/Lib/json/encoder.py", line 246, in iterencode return _iterencode(o, 0) MemoryError Command exited with non-zero status 1 11.25user 2.43system 0:13.72elapsed 99%CPU (0avgtext+0avgdata 27820320maxresident)k 2920inputs+0outputs (12major+1261388minor)pagefaults 0swaps I suppose the encoder internally builds a large list of very small unicode objects, and only joins them at the end. Probably we could join it by chunks so as to avoid this behaviour.
msg142360 - (view)	Author: (poq)	Date: 2011-08-18 16:35
I think this is because dumps() uses the C encoder. Making the C encoder incremental (i.e. iterator-based) like the Python encoder would solve this. I actually looked into doing this for issue #12134, but it didn't seem so simple; Since C has no yield, I think the iterator would need to maintain its own stack to keep track of where it is in the object tree it's encoding... If there is interest though, I may be able to write a patch when I have some time off again...
msg142390 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-08-18 18:14
This patch does the trick.
msg142455 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-08-19 14:09
> I actually looked into doing this for issue #12134, but it didn't seem > so simple; Since C has no yield, I think the iterator would need to > maintain its own stack to keep track of where it is in the object tree > it's encoding... The encoder doesn't have to be turned into an iterator. It would just need to call a given callable (fp.write) at regular intervals and that would be enough to C-accelerate dump(). My patch actually provides a good foundation for this.
msg142471 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-08-19 16:05
New changeset 47176e8d7060 by Antoine Pitrou in branch 'default': Issue #12778: Reduce memory consumption when JSON-encoding a large container of many small objects. http://hg.python.org/cpython/rev/47176e8d7060
msg142484 - (view)	Author: (poq)	Date: 2011-08-19 18:14
> It would just need to call a given callable (fp.write) at regular intervals and that would be enough to C-accelerate dump(). True, but that would just special case dump(), just like dumps() is special-cased now. Ideally JSONEncoder.iterencode() would be accelerated, so you wouldn't need any special cases. Or deprecate iterencode() and replace it with a callback interface...
msg142486 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-08-19 18:28
> > It would just need to call a given callable (fp.write) at regular > intervals and that would be enough to C-accelerate dump(). > > True, but that would just special case dump(), just like dumps() is > special-cased now. Ideally JSONEncoder.iterencode() would be > accelerated, so you wouldn't need any special cases. Or deprecate > iterencode() and replace it with a callback interface... Is iterencode() used much? I would think dump() and dumps() see the most use.
msg142489 - (view)	Author: (poq)	Date: 2011-08-19 19:08
> Is iterencode() used much? I would think dump() and dumps() see the most use. Of course. I'd just prefer an elegant & complete solution. But I agree accelerating just dump() would already be much better than the current situation.

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56987
2011-08-19 19:08:19	poq	set	messages: + msg142489
2011-08-19 18:28:30	pitrou	set	messages: + msg142486
2011-08-19 18:14:17	poq	set	messages: + msg142484
2011-08-19 16:09:39	pitrou	set	status: open -> closed resolution: fixed stage: resolved
2011-08-19 16:05:55	python-dev	set	nosy: + python-dev messages: + msg142471
2011-08-19 14:09:52	pitrou	set	messages: + msg142455
2011-08-18 18:14:48	pitrou	set	files: + jsonacc.patch keywords: + patch messages: + msg142390
2011-08-18 16:35:01	poq	set	nosy: + poq messages: + msg142360
2011-08-18 15:23:15	pitrou	create