Issue 31993: pickle.dump allocates unnecessary temporary bytes / str

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76174

classification

Title:	pickle.dump allocates unnecessary temporary bytes / str
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Olivier.Grisel, pitrou, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2017-11-09 18:11 by Olivier.Grisel, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 4353	merged	python-dev, 2017-11-09 18:14
PR 5114	merged	serhiy.storchaka, 2018-01-06 19:15
PR 5154	merged	serhiy.storchaka, 2018-01-11 11:13

Messages (39)
msg305975 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-09 18:11
I noticed that both pickle.Pickler (C version) and pickle._Pickler (Python version) make unnecessary memory copies when dumping large str, bytes and bytearray objects. This is caused by unnecessary concatenation of the opcode and size header with the large bytes payload prior to calling self.write. For protocol 4, an additional copy is caused by the framing mechanism. I will submit a pull request to fix the issue for the Python version. I am not sure how to test this properly. The BigmemPickleTests seems to be skipped on my 16 GB laptop.
msg305976 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-09 18:12
You don't need to add a test for a performance enhancement.
msg305977 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-09 18:13
Of course, +1 for fixing this.
msg305978 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-09 18:29
As for the C pickler, currently it dumps the whole pickle into an internal buffer before calling write() at the end. You may want to make writing more incremental. See Modules/_pickler.c (especially _Pickler_Write()).
msg305979 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-11-09 18:34
Would be nice to see benchmarks. And what about C version?
msg305990 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-09 22:15
I wrote a script to monitor the memory when dumping 2GB of data with python master (C pickler and Python pickler): ``` (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 5.141s => peak memory usage: 4.014 GB (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 5.046s => peak memory usage: 5.955 GB ``` This is using protocol 4. Note that the C pickler is only making 1 useless memory copy instead of 2 for the Python pickler (one for the concatenation and the other because of the framing mechanism of protocol 4). Here the output with the Python pickler fixed in python/cpython#4353: ``` (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 6.138s => peak memory usage: 2.014 GB ``` Basically the 2 spurious memory copies of the Python pickler with protocol 4 are gone. Here is the script: https://gist.github.com/ogrisel/0e7b3282c84ae4a581f3b9ec1d84b45a
msg305991 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-09 22:17
But the total runtime is higher? (6 s. vs. 5 s.) Can you post the CPU time? (as measured by `time`, for example)
msg305992 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-09 22:17
Note that the time difference is not significant. I rerun the last command I got: ``` (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 4.187s => peak memory usage: 2.014 GB ```
msg305993 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-09 22:21
More benchmarks with the unix time command: ``` (py37) ogrisel@ici:~/code/cpython$ git checkout master Switched to branch 'master' Your branch is up-to-date with 'origin/master'. (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 10.677s => peak memory usage: 5.936 GB real 0m11.068s user 0m0.940s sys 0m5.204s (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 5.089s => peak memory usage: 5.978 GB real 0m5.367s user 0m0.840s sys 0m4.660s (py37) ogrisel@ici:~/code/cpython$ git checkout issue-31993-pypickle-dump-mem-optim Switched to branch 'issue-31993-pypickle-dump-mem-optim' (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 6.974s => peak memory usage: 2.014 GB real 0m7.300s user 0m0.368s sys 0m4.640s (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 10.873s => peak memory usage: 2.014 GB real 0m11.178s user 0m0.324s sys 0m5.100s (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle Allocating source data... => peak memory usage: 2.014 GB Dumping to disk... done in 4.233s => peak memory usage: 2.014 GB real 0m4.574s user 0m0.396s sys 0m4.368s ``` User time is always better in the PR than on master but is also much slower than system time (disk access) in any case. System time is much less deterministic.
msg305994 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-09 22:26
So we're saving memory and CPU time. Cool!
msg306024 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-11-10 11:25
Actually the time varies too much between runs. 1.641s ... 8.475s ... 12.645s
msg306025 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-10 12:00
In my last comment, I also reported the user times (not spend in OS level disk access stuff): the code of the PR is on the order of 300-400ms while master is around 800ms or more.
msg306026 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-11-10 12:15
I'll try to write the C implementation. Maybe it will use other heuristic.
msg306029 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-11-10 13:13
This speeds up pickling large bytes objects. $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])1000000 for i in range(256)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)' Unpatched: 10 loops, best of 5: 20.7 msec per loop Patched: 200 loops, best of 5: 1.12 msec per loop But slows down pickling short bytes objects longer than 256 bytes (up to 40%). $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])1000 for i in range(25600)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)' Unpatched: 5 loops, best of 5: 77.8 msec per loop Patched: 2 loops, best of 5: 98.5 msec per loop $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])256 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)' Unpatched: 1 loop, best of 5: 278 msec per loop Patched: 1 loop, best of 5: 382 msec per loop Compare with: $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])255 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)' Unpatched: 1 loop, best of 5: 277 msec per loop Patched: 1 loop, best of 5: 273 msec per loop I think the code should be optimized for decreasing an overhead of _write_many().
msg306031 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-10 14:38
I have pushed a new version of the code that now has a 10% overhead for small bytes (instead of 40% previously). It could be possible to optimize further but I think that would render the code much less readable so I would be tempted to keep it this way. Please let me know what you think.
msg306032 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-10 14:42
Actually, I think this can still be improved while keeping it readable. Let me try again :)
msg306033 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-10 14:57
Alright, the last version has now ~4% overhead for small bytes.
msg306035 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-11-10 15:09
Nice! I have got virtually the same code as your intermediate variant, but your final variant event better!
msg306042 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-10 18:32
BTW, I am looking at the C implementation at the moment. I think I can do it.
msg306062 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-10 23:37
I have tried to implement the direct write bypass for the C version of the pickler but I get a segfault in a Py_INCREF on obj during the call to memo_put(self, obj) after the call to _Pickler_write_large_bytes. Here is the diff of my current version of the patch: https://github.com/ogrisel/cpython/commit/4e093ad6993616a9f16e863b72bf2d2e37bc27b4 I am new to the Python C-API so I would appreciate some help on this one.
msg306088 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-11 15:39
Alright, I found the source of my refcounting bug. I updated the PR to include the C version of the dump for PyBytes. I ran Serhiy's microbenchmarks on the C version and I could not detect any overhead on small bytes objects while I get a ~20x speedup (and no-memory copy) on large bytes objects as expected. I would like to update the `write_utf8` function but I would need to find a way to wrap `const char* data` as a PyBytes instance without making a memory copy to be able to pass it to my `_Pickle_write_large_bytes`. I browsed the C-API documentation but I could not understand how to do that. Also I would appreciate any feedback on the code style or things that could be improved in my PR.
msg306092 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-11 18:32
> I would like to update the `write_utf8` function but I would need to find a way to wrap `const char* data` as a PyBytes instance without making a memory copy to be able to pass it to my `_Pickle_write_large_bytes`. You should pass it as a memoryview instead: https://docs.python.org/3/c-api/memoryview.html#c.PyMemoryView_FromMemory
msg306111 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-11-12 12:08
While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.
msg306112 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-12 12:43
Thanks Antoine, I updated my code to what you suggested.
msg306116 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-12 15:48
> While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects. I think it's a good idea. The C pickler would behave more like the Python pickler. I think framing was intended this way initially. Antoine what do you think?
msg306117 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-12 15:50
Framing was originally intended to improve unpickling (since you don't have to issue lots of tiny file reads anymore). No objection to also improve pickling, though :-)
msg306121 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2017-11-12 16:45
Flushing the buffer at each frame commit will cause a medium-sized write every 64kB on average (instead of one big write at the end). So that might actually cause a performance regression for some users if the individual file-object writes induce significant overhead. In practice though, latency inducing file objects like filesystem-backed ones are likely to derive from the [BufferedWriter](https://docs.python.org/3/library/io.html#io.BufferedWriter) base class and the only latency we should really care about it the one induced by the write call overhead itself in which case the 64kB frame / buffer size should be enough.
msg306122 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-12 16:48
Agreed. We shouldn't issue very small writes, but 64 kB is generally considered a reasonable buffer size for many kinds of I/O. Besides, it wouldn't be difficult to make the target frame size configurable if a use case arose for it, but I don't think we've ever had such a request.
msg309551 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-06 15:18
New changeset 3cd7c6e6eb43dbd7d7180503265772a67953e682 by Serhiy Storchaka (Olivier Grisel) in branch 'master': bpo-31993: Do not allocate large temporary buffers in pickle dump. (#4353) https://github.com/python/cpython/commit/3cd7c6e6eb43dbd7d7180503265772a67953e682
msg309558 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2018-01-06 17:11
Shall we close this issue now that the PR has been merged to master?
msg309559 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-06 17:32
Definitely! Thank you for your contribution :-)
msg309560 - (view)	Author: Olivier Grisel (Olivier.Grisel) *	Date: 2018-01-06 17:34
Thanks for the very helpful feedback and guidance during the review.
msg309565 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-06 18:38
Humm, this feature doesn't work with C implementation.
msg309570 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2018-01-06 19:16
> Humm, this feature doesn't work with C implementation. What do you mean?
msg309571 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-06 19:17
PR 5114 implements this when serialize in C into memory.
msg309572 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-06 19:26
> What do you mean? Large bytes and strings was written inside a frame when serialize by dumps(). dump() (as well as Python implementations of dumps() and dump()) write them outside of a frame.
msg309670 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-08 15:56
I'll create a separate PR for the memoroview issue after merging PR 5114.
msg309801 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-11 11:03
New changeset 0a2da50e1867831251fad75377d0f10713eb6301 by Serhiy Storchaka in branch 'master': bpo-31993: Do not create frames for large bytes and str objects (#5114) https://github.com/python/cpython/commit/0a2da50e1867831251fad75377d0f10713eb6301
msg309872 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-01-12 22:28
New changeset 5b76bdba071e7bbd9fda0b9b100d1506d95c04bd by Serhiy Storchaka in branch 'master': bpo-31993: Do not use memoryview when pickle large strings. (#5154) https://github.com/python/cpython/commit/5b76bdba071e7bbd9fda0b9b100d1506d95c04bd

History
Date	User	Action	Args
2022-04-11 14:58:54	admin	set	github: 76174
2018-01-12 22:29:53	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2018-01-12 22:28:39	serhiy.storchaka	set	messages: + msg309872
2018-01-11 11:13:02	serhiy.storchaka	set	pull_requests: + pull_request5009
2018-01-11 11:03:28	serhiy.storchaka	set	messages: + msg309801
2018-01-08 15:56:16	serhiy.storchaka	set	messages: + msg309670
2018-01-06 19:26:22	serhiy.storchaka	set	messages: + msg309572
2018-01-06 19:17:28	serhiy.storchaka	set	status: closed -> open resolution: fixed -> (no value) messages: + msg309571 stage: resolved -> patch review
2018-01-06 19:16:07	pitrou	set	messages: + msg309570
2018-01-06 19:15:09	serhiy.storchaka	set	pull_requests: + pull_request4980
2018-01-06 18:38:40	serhiy.storchaka	set	messages: + msg309565
2018-01-06 17:34:03	Olivier.Grisel	set	messages: + msg309560
2018-01-06 17:32:44	pitrou	set	status: open -> closed resolution: fixed messages: + msg309559 stage: patch review -> resolved
2018-01-06 17:11:21	Olivier.Grisel	set	messages: + msg309558
2018-01-06 15:18:57	serhiy.storchaka	set	messages: + msg309551
2017-11-12 16:48:35	pitrou	set	messages: + msg306122
2017-11-12 16:45:49	Olivier.Grisel	set	messages: + msg306121
2017-11-12 15:50:45	pitrou	set	messages: + msg306117
2017-11-12 15:48:14	Olivier.Grisel	set	messages: + msg306116
2017-11-12 12:43:24	Olivier.Grisel	set	messages: + msg306112
2017-11-12 12:08:09	serhiy.storchaka	set	messages: + msg306111
2017-11-11 18:32:13	pitrou	set	messages: + msg306092
2017-11-11 15:39:07	Olivier.Grisel	set	messages: + msg306088
2017-11-10 23:37:38	Olivier.Grisel	set	messages: + msg306062
2017-11-10 18:32:15	Olivier.Grisel	set	messages: + msg306042
2017-11-10 15:09:43	serhiy.storchaka	set	messages: + msg306035
2017-11-10 14:57:13	Olivier.Grisel	set	messages: + msg306033
2017-11-10 14:42:16	Olivier.Grisel	set	messages: + msg306032
2017-11-10 14:38:40	Olivier.Grisel	set	messages: + msg306031
2017-11-10 13:13:27	serhiy.storchaka	set	messages: + msg306029
2017-11-10 12:15:12	serhiy.storchaka	set	messages: + msg306026
2017-11-10 12:00:18	Olivier.Grisel	set	messages: + msg306025
2017-11-10 11:25:23	serhiy.storchaka	set	messages: + msg306024
2017-11-09 22:26:55	pitrou	set	messages: + msg305994
2017-11-09 22:21:24	Olivier.Grisel	set	messages: + msg305993
2017-11-09 22:17:41	Olivier.Grisel	set	messages: + msg305992
2017-11-09 22:17:19	pitrou	set	messages: + msg305991
2017-11-09 22:15:49	Olivier.Grisel	set	messages: + msg305990
2017-11-09 18:34:08	serhiy.storchaka	set	messages: + msg305979
2017-11-09 18:30:10	pitrou	set	stage: needs patch -> patch review
2017-11-09 18:29:54	pitrou	set	messages: + msg305978 stage: patch review -> needs patch
2017-11-09 18:25:20	serhiy.storchaka	set	nosy: + serhiy.storchaka
2017-11-09 18:14:39	python-dev	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request4309
2017-11-09 18:13:32	pitrou	set	messages: + msg305977
2017-11-09 18:12:32	pitrou	set	type: resource usage -> performance stage: needs patch messages: + msg305976 versions: - Python 3.4, Python 3.5, Python 3.6, Python 3.8
2017-11-09 18:11:50	Olivier.Grisel	create