pickle.dump allocates unnecessary temporary bytes / str #76174

ogrisel · 2017-11-09T18:11:51Z

BPO	31993
Nosy	@pitrou, @serhiy-storchaka, @ogrisel
PRs	bpo-31993: do not allocate large temporary buffers in pickle dump #4353 bpo-31993: Do not create frames for large bytes and str objects #5114 bpo-31993: Do not use memoryview when pickle large strings. #5154

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-01-12.22:29:53.774>
created_at = <Date 2017-11-09.18:11:50.814>
labels = ['3.7', 'library', 'performance']
title = 'pickle.dump allocates unnecessary temporary bytes / str'
updated_at = <Date 2018-01-12.22:29:53.773>
user = 'https://github.com/ogrisel'

bugs.python.org fields:

activity = <Date 2018-01-12.22:29:53.773>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2018-01-12.22:29:53.774>
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation = <Date 2017-11-09.18:11:50.814>
creator = 'Olivier.Grisel'
dependencies = []
files = []
hgrepos = []
issue_num = 31993
keywords = ['patch']
message_count = 39.0
messages = ['305975', '305976', '305977', '305978', '305979', '305990', '305991', '305992', '305993', '305994', '306024', '306025', '306026', '306029', '306031', '306032', '306033', '306035', '306042', '306062', '306088', '306092', '306111', '306112', '306116', '306117', '306121', '306122', '309551', '309558', '309559', '309560', '309565', '309570', '309571', '309572', '309670', '309801', '309872']
nosy_count = 3.0
nosy_names = ['pitrou', 'serhiy.storchaka', 'Olivier.Grisel']
pr_nums = ['4353', '5114', '5154']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue31993'
versions = ['Python 3.7']

ogrisel · 2017-11-09T18:11:51Z

I noticed that both pickle.Pickler (C version) and pickle._Pickler (Python version) make unnecessary memory copies when dumping large str, bytes and bytearray objects.

This is caused by unnecessary concatenation of the opcode and size header with the large bytes payload prior to calling self.write.

For protocol 4, an additional copy is caused by the framing mechanism.

I will submit a pull request to fix the issue for the Python version. I am not sure how to test this properly. The BigmemPickleTests seems to be skipped on my 16 GB laptop.

pitrou · 2017-11-09T18:12:33Z

You don't need to add a test for a performance enhancement.

pitrou · 2017-11-09T18:13:33Z

Of course, +1 for fixing this.

pitrou · 2017-11-09T18:29:54Z

As for the C pickler, currently it dumps the whole pickle into an internal buffer before calling write() at the end. You may want to make writing more incremental. See Modules/_pickler.c (especially _Pickler_Write()).

serhiy-storchaka · 2017-11-09T18:34:09Z

Would be nice to see benchmarks.

And what about C version?

ogrisel · 2017-11-09T22:15:49Z

I wrote a script to monitor the memory when dumping 2GB of data with python master (C pickler and Python pickler):

(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 5.141s
=> peak memory usage: 4.014 GB
(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 5.046s
=> peak memory usage: 5.955 GB

This is using protocol 4. Note that the C pickler is only making 1 useless memory copy instead of 2 for the Python pickler (one for the concatenation and the other because of the framing mechanism of protocol 4).

Here the output with the Python pickler fixed in #4353:

(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 6.138s
=> peak memory usage: 2.014 GB

Basically the 2 spurious memory copies of the Python pickler with protocol 4 are gone.

Here is the script: https://gist.github.com/ogrisel/0e7b3282c84ae4a581f3b9ec1d84b45a

pitrou · 2017-11-09T22:17:20Z

But the total runtime is higher? (6 s. vs. 5 s.) Can you post the CPU time? (as measured by time, for example)

ogrisel · 2017-11-09T22:17:41Z

Note that the time difference is not significant. I rerun the last command I got:

(py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 4.187s
=> peak memory usage: 2.014 GB

ogrisel · 2017-11-09T22:21:24Z

More benchmarks with the unix time command:


(py37) ogrisel@ici:~/code/cpython$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 10.677s
=> peak memory usage: 5.936 GB

real	0m11.068s
user	0m0.940s
sys	0m5.204s
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 5.089s
=> peak memory usage: 5.978 GB

real	0m5.367s
user	0m0.840s
sys	0m4.660s
(py37) ogrisel@ici:~/code/cpython$ git checkout issue-31993-pypickle-dump-mem-optim 
Switched to branch 'issue-31993-pypickle-dump-mem-optim'
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 6.974s
=> peak memory usage: 2.014 GB

real	0m7.300s
user	0m0.368s
sys	0m4.640s
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 10.873s
=> peak memory usage: 2.014 GB

real	0m11.178s
user	0m0.324s
sys	0m5.100s
(py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
Allocating source data...
=> peak memory usage: 2.014 GB
Dumping to disk...
done in 4.233s
=> peak memory usage: 2.014 GB

real	0m4.574s
user	0m0.396s
sys	0m4.368s

User time is always better in the PR than on master but is also much slower than system time (disk access) in any case. System time is much less deterministic.

pitrou · 2017-11-09T22:26:56Z

So we're saving memory and CPU time. Cool!

serhiy-storchaka · 2017-11-10T11:25:24Z

Actually the time varies too much between runs. 1.641s ... 8.475s ... 12.645s

ogrisel · 2017-11-10T12:00:18Z

In my last comment, I also reported the user times (not spend in OS level disk access stuff): the code of the PR is on the order of 300-400ms while master is around 800ms or more.

serhiy-storchaka · 2017-11-10T12:15:13Z

I'll try to write the C implementation. Maybe it will use other heuristic.

serhiy-storchaka · 2017-11-10T13:13:28Z

This speeds up pickling large bytes objects.

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000000 for i in range(256)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  10 loops, best of 5: 20.7 msec per loop
Patched:    200 loops, best of 5: 1.12 msec per loop

But slows down pickling short bytes objects longer than 256 bytes (up to 40%).

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000 for i in range(25600)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  5 loops, best of 5: 77.8 msec per loop
Patched:    2 loops, best of 5: 98.5 msec per loop

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*256 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  1 loop, best of 5: 278 msec per loop
Patched:    1 loop, best of 5: 382 msec per loop

Compare with:

$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*255 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched:  1 loop, best of 5: 277 msec per loop
Patched:    1 loop, best of 5: 273 msec per loop

I think the code should be optimized for decreasing an overhead of _write_many().

ogrisel · 2017-11-10T14:38:40Z

I have pushed a new version of the code that now has a 10% overhead for small bytes (instead of 40% previously).

It could be possible to optimize further but I think that would render the code much less readable so I would be tempted to keep it this way.

Please let me know what you think.

ogrisel · 2017-11-10T14:42:17Z

Actually, I think this can still be improved while keeping it readable. Let me try again :)

ogrisel · 2017-11-10T14:57:14Z

Alright, the last version has now ~4% overhead for small bytes.

serhiy-storchaka · 2017-11-10T15:09:44Z

Nice! I have got virtually the same code as your intermediate variant, but your final variant event better!

ogrisel · 2017-11-10T18:32:15Z

BTW, I am looking at the C implementation at the moment. I think I can do it.

ogrisel · 2017-11-10T23:37:39Z

I have tried to implement the direct write bypass for the C version of the pickler but I get a segfault in a Py_INCREF on obj during the call to memo_put(self, obj) after the call to _Pickler_write_large_bytes.

Here is the diff of my current version of the patch:

ogrisel@4e093ad

I am new to the Python C-API so I would appreciate some help on this one.

ogrisel · 2017-11-11T15:39:07Z

Alright, I found the source of my refcounting bug. I updated the PR to include the C version of the dump for PyBytes.

I ran Serhiy's microbenchmarks on the C version and I could not detect any overhead on small bytes objects while I get a ~20x speedup (and no-memory copy) on large bytes objects as expected.

I would like to update the write_utf8 function but I would need to find a way to wrap const char* data as a PyBytes instance without making a memory copy to be able to pass it to my _Pickle_write_large_bytes. I browsed the C-API documentation but I could not understand how to do that.

Also I would appreciate any feedback on the code style or things that could be improved in my PR.

pitrou · 2017-11-11T18:32:13Z

I would like to update the write_utf8 function but I would need to find a way to wrap const char* data as a PyBytes instance without making a memory copy to be able to pass it to my _Pickle_write_large_bytes.

You should pass it as a memoryview instead:
https://docs.python.org/3/c-api/memoryview.html#c.PyMemoryView_FromMemory

serhiy-storchaka · 2017-11-12T12:08:10Z

While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.

ogrisel · 2017-11-12T12:43:24Z

Thanks Antoine, I updated my code to what you suggested.

ogrisel · 2017-11-12T15:48:15Z

While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.

I think it's a good idea. The C pickler would behave more like the Python pickler. I think framing was intended this way initially. Antoine what do you think?

pitrou · 2017-11-12T15:50:46Z

Framing was originally intended to improve unpickling (since you don't have to issue lots of tiny file reads anymore). No objection to also improve pickling, though :-)

ogrisel · 2017-11-12T16:45:49Z

Flushing the buffer at each frame commit will cause a medium-sized write every 64kB on average (instead of one big write at the end). So that might actually cause a performance regression for some users if the individual file-object writes induce significant overhead.

In practice though, latency inducing file objects like filesystem-backed ones are likely to derive from the BufferedWriter base class and the only latency we should really care about it the one induced by the write call overhead itself in which case the 64kB frame / buffer size should be enough.

pitrou · 2017-11-12T16:48:35Z

Agreed. We shouldn't issue very small writes, but 64 kB is generally considered a reasonable buffer size for many kinds of I/O.

Besides, it wouldn't be difficult to make the target frame size configurable if a use case arose for it, but I don't think we've ever had such a request.

serhiy-storchaka · 2018-01-06T15:18:57Z

New changeset 3cd7c6e by Serhiy Storchaka (Olivier Grisel) in branch 'master':
bpo-31993: Do not allocate large temporary buffers in pickle dump. (bpo-4353)
3cd7c6e

ogrisel · 2018-01-06T17:11:21Z

Shall we close this issue now that the PR has been merged to master?

pitrou · 2018-01-06T17:32:44Z

Definitely! Thank you for your contribution :-)

ogrisel · 2018-01-06T17:34:03Z

Thanks for the very helpful feedback and guidance during the review.

serhiy-storchaka · 2018-01-06T18:38:40Z

Humm, this feature doesn't work with C implementation.

pitrou · 2018-01-06T19:16:07Z

Humm, this feature doesn't work with C implementation.

What do you mean?

serhiy-storchaka · 2018-01-06T19:17:28Z

PR 5114 implements this when serialize in C into memory.

serhiy-storchaka · 2018-01-06T19:26:23Z

What do you mean?

Large bytes and strings was written inside a frame when serialize by dumps().
dump() (as well as Python implementations of dumps() and dump()) write them
outside of a frame.

serhiy-storchaka · 2018-01-08T15:56:16Z

I'll create a separate PR for the memoroview issue after merging PR 5114.

serhiy-storchaka · 2018-01-11T11:03:28Z

New changeset 0a2da50 by Serhiy Storchaka in branch 'master':
bpo-31993: Do not create frames for large bytes and str objects (bpo-5114)
0a2da50

serhiy-storchaka · 2018-01-12T22:28:39Z

New changeset 5b76bdb by Serhiy Storchaka in branch 'master':
bpo-31993: Do not use memoryview when pickle large strings. (bpo-5154)
5b76bdb

ogrisel mannequin added 3.8 only security fixes performance Performance or resource usage 3.7 (EOL) end of life stdlib Python modules in the Lib dir labels Nov 9, 2017

pitrou added performance Performance or resource usage and removed 3.8 only security fixes performance Performance or resource usage labels Nov 9, 2017

pitrou closed this as completed Jan 6, 2018

serhiy-storchaka reopened this Jan 6, 2018

serhiy-storchaka closed this as completed Jan 12, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

pickle.dump allocates unnecessary temporary bytes / str #76174

pickle.dump allocates unnecessary temporary bytes / str #76174

Comments

ogrisel mannequin commented Nov 9, 2017

ogrisel mannequin commented Nov 9, 2017

pitrou commented Nov 9, 2017

pitrou commented Nov 9, 2017

pitrou commented Nov 9, 2017

serhiy-storchaka commented Nov 9, 2017

ogrisel mannequin commented Nov 9, 2017

pitrou commented Nov 9, 2017

ogrisel mannequin commented Nov 9, 2017

ogrisel mannequin commented Nov 9, 2017

pitrou commented Nov 9, 2017

serhiy-storchaka commented Nov 10, 2017

ogrisel mannequin commented Nov 10, 2017

serhiy-storchaka commented Nov 10, 2017

serhiy-storchaka commented Nov 10, 2017

ogrisel mannequin commented Nov 10, 2017

ogrisel mannequin commented Nov 10, 2017

ogrisel mannequin commented Nov 10, 2017

serhiy-storchaka commented Nov 10, 2017

ogrisel mannequin commented Nov 10, 2017

ogrisel mannequin commented Nov 10, 2017

ogrisel mannequin commented Nov 11, 2017

pitrou commented Nov 11, 2017

serhiy-storchaka commented Nov 12, 2017

ogrisel mannequin commented Nov 12, 2017

ogrisel mannequin commented Nov 12, 2017

pitrou commented Nov 12, 2017

ogrisel mannequin commented Nov 12, 2017

pitrou commented Nov 12, 2017

serhiy-storchaka commented Jan 6, 2018

ogrisel mannequin commented Jan 6, 2018

pitrou commented Jan 6, 2018

ogrisel mannequin commented Jan 6, 2018

serhiy-storchaka commented Jan 6, 2018

pitrou commented Jan 6, 2018

serhiy-storchaka commented Jan 6, 2018

serhiy-storchaka commented Jan 6, 2018

serhiy-storchaka commented Jan 8, 2018

serhiy-storchaka commented Jan 11, 2018

serhiy-storchaka commented Jan 12, 2018