classification
Title: pickle.dump allocates unnecessary temporary bytes / str
Type: performance Stage: patch review
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Olivier.Grisel, pitrou, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-11-09 18:11 by Olivier.Grisel, last changed 2017-11-12 16:48 by pitrou.

Pull Requests
URL Status Linked Edit
PR 4353 open python-dev, 2017-11-09 18:14
Messages (28)
msg305975 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-09 18:11
I noticed that both pickle.Pickler (C version) and pickle._Pickler (Python version) make unnecessary memory copies when dumping large str, bytes and bytearray objects.

This is caused by unnecessary concatenation of the opcode and size header with the large bytes payload prior to calling self.write.

For protocol 4, an additional copy is caused by the framing mechanism.

I will submit a pull request to fix the issue for the Python version. I am not sure how to test this properly. The BigmemPickleTests seems to be skipped on my 16 GB laptop.
msg305976 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-09 18:12
You don't need to add a test for a performance enhancement.
msg305977 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-09 18:13
Of course, +1 for fixing this.
msg305978 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-09 18:29
As for the C pickler, currently it dumps the whole pickle into an internal buffer before calling write() at the end.  You may want to make writing more incremental.  See Modules/_pickler.c (especially _Pickler_Write()).
msg305979 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-09 18:34
Would be nice to see benchmarks.

And what about C version?
msg305990 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-09 22:15
I wrote a script to monitor the memory when dumping 2GB of data with python master (C pickler and Python pickler):

```
(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 5.141s
=> peak memory usage: 4.014 GB
(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 5.046s
=> peak memory usage: 5.955 GB
```

This is using protocol 4. Note that the C pickler is only making 1 useless memory copy instead of 2 for the Python pickler (one for the concatenation and the other because of the framing mechanism of protocol 4).

Here the output with the Python pickler fixed in python/cpython#4353:

```
(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 6.138s
=> peak memory usage: 2.014 GB
```


Basically the 2 spurious memory copies of the Python pickler with protocol 4 are gone.

Here is the script: https://gist.github.com/ogrisel/0e7b3282c84ae4a581f3b9ec1d84b45a
msg305991 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-09 22:17
But the total runtime is higher? (6 s. vs. 5 s.)  Can you post the CPU time?  (as measured by `time`, for example)
msg305992 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-09 22:17
Note that the time difference is not significant. I rerun the last command I got:

```
(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 4.187s
=> peak memory usage: 2.014 GB
```
msg305993 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-09 22:21
More benchmarks with the unix time command:

```

(py37) ogrisel@ici:~/code/cpython$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 10.677s
=> peak memory usage: 5.936 GB

real	0m11.068s
user	0m0.940s
sys	0m5.204s
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 5.089s
=> peak memory usage: 5.978 GB

real	0m5.367s
user	0m0.840s
sys	0m4.660s
(py37) ogrisel@ici:~/code/cpython$ git checkout issue-31993-pypickle-dump-mem-optim 
Switched to branch 'issue-31993-pypickle-dump-mem-optim'
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 6.974s
=> peak memory usage: 2.014 GB

real	0m7.300s
user	0m0.368s
sys	0m4.640s
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 10.873s
=> peak memory usage: 2.014 GB

real	0m11.178s
user	0m0.324s
sys	0m5.100s
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 4.233s
=> peak memory usage: 2.014 GB

real	0m4.574s
user	0m0.396s
sys	0m4.368s
```

User time is always better in the PR than on master but is also much slower than system time (disk access) in any case. System time is much less deterministic.
msg305994 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-09 22:26
So we're saving memory and CPU time.  Cool!
msg306024 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-10 11:25
Actually the time varies too much between runs. 1.641s ... 8.475s ... 12.645s
msg306025 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-10 12:00
In my last comment, I also reported the user times (not spend in OS level disk access stuff): the code of the PR is on the order of 300-400ms while master is around 800ms or more.
msg306026 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-10 12:15
I'll try to write the C implementation. Maybe it will use other heuristic.
msg306029 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-10 13:13
This speeds up pickling large bytes objects.

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000000 for i in range(256)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  10 loops, best of 5: 20.7 msec per loop
Patched:    200 loops, best of 5: 1.12 msec per loop

But slows down pickling short bytes objects longer than 256 bytes (up to 40%).

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000 for i in range(25600)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  5 loops, best of 5: 77.8 msec per loop
Patched:    2 loops, best of 5: 98.5 msec per loop

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*256 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  1 loop, best of 5: 278 msec per loop
Patched:    1 loop, best of 5: 382 msec per loop

Compare with:

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*255 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  1 loop, best of 5: 277 msec per loop
Patched:    1 loop, best of 5: 273 msec per loop

I think the code should be optimized for decreasing an overhead of _write_many().
msg306031 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-10 14:38
I have pushed a new version of the code that now has a 10% overhead for small bytes (instead of 40% previously).

It could be possible to optimize further but I think that would render the code much less readable so I would be tempted to keep it this way.

Please let me know what you think.
msg306032 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-10 14:42
Actually, I think this can still be improved while keeping it readable. Let me try again :)
msg306033 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-10 14:57
Alright, the last version has now ~4% overhead for small bytes.
msg306035 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-10 15:09
Nice! I have got virtually the same code as your intermediate variant, but your final variant event better!
msg306042 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-10 18:32
BTW, I am looking at the C implementation at the moment. I think I can do it.
msg306062 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-10 23:37
I have tried to implement the direct write bypass for the C version of the pickler but I get a segfault in a Py_INCREF on obj during the call to  memo_put(self, obj) after the call to _Pickler_write_large_bytes.

Here is the diff of my current version of the patch:

https://github.com/ogrisel/cpython/commit/4e093ad6993616a9f16e863b72bf2d2e37bc27b4

I am new to the Python C-API so I would appreciate some help on this one.
msg306088 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-11 15:39
Alright, I found the source of my refcounting bug. I updated the PR to include the C version of the dump for PyBytes.

I ran Serhiy's microbenchmarks on the C version and I could not detect any overhead on small bytes objects while I get a ~20x speedup (and no-memory copy) on large bytes objects as expected.

I would like to update the `write_utf8` function but I would need to find a way to wrap `const char* data` as a PyBytes instance without making a memory copy to be able to pass it to my `_Pickle_write_large_bytes`. I browsed the C-API documentation but I could not understand how to do that.

Also I would appreciate any feedback on the code style or things that could be improved in my PR.
msg306092 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-11 18:32
> I would like to update the `write_utf8` function but I would need to find a way to wrap `const char* data` as a PyBytes instance without making a memory copy to be able to pass it to my `_Pickle_write_large_bytes`. 

You should pass it as a memoryview instead:
https://docs.python.org/3/c-api/memoryview.html#c.PyMemoryView_FromMemory
msg306111 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-12 12:08
While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.
msg306112 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-12 12:43
Thanks Antoine, I updated my code to what you suggested.
msg306116 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-12 15:48
> While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.

I think it's a good idea. The C pickler would behave more like the Python pickler. I think framing was intended this way initially. Antoine what do you think?
msg306117 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-12 15:50
Framing was originally intended to improve unpickling (since you don't have to issue lots of tiny file reads anymore).  No objection to also improve pickling, though :-)
msg306121 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2017-11-12 16:45
Flushing the buffer at each frame commit will cause a medium-sized write every 64kB on average (instead of one big write at the end). So that might actually cause a performance regression for some users if the individual file-object writes induce significant overhead.

In practice though, latency inducing file objects like filesystem-backed ones are likely to derive from the [BufferedWriter](https://docs.python.org/3/library/io.html#io.BufferedWriter) base class and the only latency we should really care about it the one induced by the write call overhead itself in which case the 64kB frame / buffer size should be enough.
msg306122 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-12 16:48
Agreed.  We shouldn't issue very small writes, but 64 kB is generally considered a reasonable buffer size for many kinds of I/O.

Besides, it wouldn't be difficult to make the target frame size configurable if a use case arose for it, but I don't think we've ever had such a request.
History
Date User Action Args
2017-11-12 16:48:35pitrousetmessages: + msg306122
2017-11-12 16:45:49Olivier.Griselsetmessages: + msg306121
2017-11-12 15:50:45pitrousetmessages: + msg306117
2017-11-12 15:48:14Olivier.Griselsetmessages: + msg306116
2017-11-12 12:43:24Olivier.Griselsetmessages: + msg306112
2017-11-12 12:08:09serhiy.storchakasetmessages: + msg306111
2017-11-11 18:32:13pitrousetmessages: + msg306092
2017-11-11 15:39:07Olivier.Griselsetmessages: + msg306088
2017-11-10 23:37:38Olivier.Griselsetmessages: + msg306062
2017-11-10 18:32:15Olivier.Griselsetmessages: + msg306042
2017-11-10 15:09:43serhiy.storchakasetmessages: + msg306035
2017-11-10 14:57:13Olivier.Griselsetmessages: + msg306033
2017-11-10 14:42:16Olivier.Griselsetmessages: + msg306032
2017-11-10 14:38:40Olivier.Griselsetmessages: + msg306031
2017-11-10 13:13:27serhiy.storchakasetmessages: + msg306029
2017-11-10 12:15:12serhiy.storchakasetmessages: + msg306026
2017-11-10 12:00:18Olivier.Griselsetmessages: + msg306025
2017-11-10 11:25:23serhiy.storchakasetmessages: + msg306024
2017-11-09 22:26:55pitrousetmessages: + msg305994
2017-11-09 22:21:24Olivier.Griselsetmessages: + msg305993
2017-11-09 22:17:41Olivier.Griselsetmessages: + msg305992
2017-11-09 22:17:19pitrousetmessages: + msg305991
2017-11-09 22:15:49Olivier.Griselsetmessages: + msg305990
2017-11-09 18:34:08serhiy.storchakasetmessages: + msg305979
2017-11-09 18:30:10pitrousetstage: needs patch -> patch review
2017-11-09 18:29:54pitrousetmessages: + msg305978
stage: patch review -> needs patch
2017-11-09 18:25:20serhiy.storchakasetnosy: + serhiy.storchaka
2017-11-09 18:14:39python-devsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request4309
2017-11-09 18:13:32pitrousetmessages: + msg305977
2017-11-09 18:12:32pitrousettype: resource usage -> performance
stage: needs patch
messages: + msg305976
versions: - Python 3.4, Python 3.5, Python 3.6, Python 3.8
2017-11-09 18:11:50Olivier.Griselcreate