New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pickle.dump allocates unnecessary temporary bytes / str #76174
Comments
I noticed that both pickle.Pickler (C version) and pickle._Pickler (Python version) make unnecessary memory copies when dumping large str, bytes and bytearray objects. This is caused by unnecessary concatenation of the opcode and size header with the large bytes payload prior to calling self.write. For protocol 4, an additional copy is caused by the framing mechanism. I will submit a pull request to fix the issue for the Python version. I am not sure how to test this properly. The BigmemPickleTests seems to be skipped on my 16 GB laptop. |
You don't need to add a test for a performance enhancement. |
Of course, +1 for fixing this. |
As for the C pickler, currently it dumps the whole pickle into an internal buffer before calling write() at the end. You may want to make writing more incremental. See Modules/_pickler.c (especially _Pickler_Write()). |
Would be nice to see benchmarks. And what about C version? |
I wrote a script to monitor the memory when dumping 2GB of data with python master (C pickler and Python pickler):
This is using protocol 4. Note that the C pickler is only making 1 useless memory copy instead of 2 for the Python pickler (one for the concatenation and the other because of the framing mechanism of protocol 4). Here the output with the Python pickler fixed in #4353:
Basically the 2 spurious memory copies of the Python pickler with protocol 4 are gone. Here is the script: https://gist.github.com/ogrisel/0e7b3282c84ae4a581f3b9ec1d84b45a |
But the total runtime is higher? (6 s. vs. 5 s.) Can you post the CPU time? (as measured by |
Note that the time difference is not significant. I rerun the last command I got:
|
More benchmarks with the unix time command:
User time is always better in the PR than on master but is also much slower than system time (disk access) in any case. System time is much less deterministic. |
So we're saving memory and CPU time. Cool! |
Actually the time varies too much between runs. 1.641s ... 8.475s ... 12.645s |
In my last comment, I also reported the user times (not spend in OS level disk access stuff): the code of the PR is on the order of 300-400ms while master is around 800ms or more. |
I'll try to write the C implementation. Maybe it will use other heuristic. |
This speeds up pickling large bytes objects. $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000000 for i in range(256)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched: 10 loops, best of 5: 20.7 msec per loop
Patched: 200 loops, best of 5: 1.12 msec per loop But slows down pickling short bytes objects longer than 256 bytes (up to 40%). $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000 for i in range(25600)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched: 5 loops, best of 5: 77.8 msec per loop
Patched: 2 loops, best of 5: 98.5 msec per loop
$ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*256 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched: 1 loop, best of 5: 278 msec per loop
Patched: 1 loop, best of 5: 382 msec per loop Compare with: $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*255 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
Unpatched: 1 loop, best of 5: 277 msec per loop
Patched: 1 loop, best of 5: 273 msec per loop I think the code should be optimized for decreasing an overhead of _write_many(). |
I have pushed a new version of the code that now has a 10% overhead for small bytes (instead of 40% previously). It could be possible to optimize further but I think that would render the code much less readable so I would be tempted to keep it this way. Please let me know what you think. |
Actually, I think this can still be improved while keeping it readable. Let me try again :) |
Alright, the last version has now ~4% overhead for small bytes. |
Nice! I have got virtually the same code as your intermediate variant, but your final variant event better! |
BTW, I am looking at the C implementation at the moment. I think I can do it. |
I have tried to implement the direct write bypass for the C version of the pickler but I get a segfault in a Py_INCREF on obj during the call to memo_put(self, obj) after the call to _Pickler_write_large_bytes. Here is the diff of my current version of the patch: I am new to the Python C-API so I would appreciate some help on this one. |
Alright, I found the source of my refcounting bug. I updated the PR to include the C version of the dump for PyBytes. I ran Serhiy's microbenchmarks on the C version and I could not detect any overhead on small bytes objects while I get a ~20x speedup (and no-memory copy) on large bytes objects as expected. I would like to update the Also I would appreciate any feedback on the code style or things that could be improved in my PR. |
You should pass it as a memoryview instead: |
While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects. |
Thanks Antoine, I updated my code to what you suggested. |
I think it's a good idea. The C pickler would behave more like the Python pickler. I think framing was intended this way initially. Antoine what do you think? |
Framing was originally intended to improve unpickling (since you don't have to issue lots of tiny file reads anymore). No objection to also improve pickling, though :-) |
Flushing the buffer at each frame commit will cause a medium-sized write every 64kB on average (instead of one big write at the end). So that might actually cause a performance regression for some users if the individual file-object writes induce significant overhead. In practice though, latency inducing file objects like filesystem-backed ones are likely to derive from the BufferedWriter base class and the only latency we should really care about it the one induced by the write call overhead itself in which case the 64kB frame / buffer size should be enough. |
Agreed. We shouldn't issue very small writes, but 64 kB is generally considered a reasonable buffer size for many kinds of I/O. Besides, it wouldn't be difficult to make the target frame size configurable if a use case arose for it, but I don't think we've ever had such a request. |
Shall we close this issue now that the PR has been merged to master? |
Definitely! Thank you for your contribution :-) |
Thanks for the very helpful feedback and guidance during the review. |
Humm, this feature doesn't work with C implementation. |
What do you mean? |
PR 5114 implements this when serialize in C into memory. |
Large bytes and strings was written inside a frame when serialize by dumps(). |
I'll create a separate PR for the memoroview issue after merging PR 5114. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: