Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickle.dump allocates unnecessary temporary bytes / str #76174

Closed
ogrisel mannequin opened this issue Nov 9, 2017 · 39 comments
Closed

pickle.dump allocates unnecessary temporary bytes / str #76174

ogrisel mannequin opened this issue Nov 9, 2017 · 39 comments
Labels
3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@ogrisel
Copy link
Mannequin

ogrisel mannequin commented Nov 9, 2017

BPO 31993
Nosy @pitrou, @serhiy-storchaka, @ogrisel
PRs
  • bpo-31993: do not allocate large temporary buffers in pickle dump #4353
  • bpo-31993: Do not create frames for large bytes and str objects #5114
  • bpo-31993: Do not use memoryview when pickle large strings. #5154
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-01-12.22:29:53.774>
    created_at = <Date 2017-11-09.18:11:50.814>
    labels = ['3.7', 'library', 'performance']
    title = 'pickle.dump allocates unnecessary temporary bytes / str'
    updated_at = <Date 2018-01-12.22:29:53.773>
    user = 'https://github.com/ogrisel'

    bugs.python.org fields:

    activity = <Date 2018-01-12.22:29:53.773>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-01-12.22:29:53.774>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2017-11-09.18:11:50.814>
    creator = 'Olivier.Grisel'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 31993
    keywords = ['patch']
    message_count = 39.0
    messages = ['305975', '305976', '305977', '305978', '305979', '305990', '305991', '305992', '305993', '305994', '306024', '306025', '306026', '306029', '306031', '306032', '306033', '306035', '306042', '306062', '306088', '306092', '306111', '306112', '306116', '306117', '306121', '306122', '309551', '309558', '309559', '309560', '309565', '309570', '309571', '309572', '309670', '309801', '309872']
    nosy_count = 3.0
    nosy_names = ['pitrou', 'serhiy.storchaka', 'Olivier.Grisel']
    pr_nums = ['4353', '5114', '5154']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue31993'
    versions = ['Python 3.7']

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 9, 2017

    I noticed that both pickle.Pickler (C version) and pickle._Pickler (Python version) make unnecessary memory copies when dumping large str, bytes and bytearray objects.

    This is caused by unnecessary concatenation of the opcode and size header with the large bytes payload prior to calling self.write.

    For protocol 4, an additional copy is caused by the framing mechanism.

    I will submit a pull request to fix the issue for the Python version. I am not sure how to test this properly. The BigmemPickleTests seems to be skipped on my 16 GB laptop.

    @ogrisel ogrisel mannequin added 3.8 only security fixes performance Performance or resource usage 3.7 (EOL) end of life stdlib Python modules in the Lib dir labels Nov 9, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Nov 9, 2017

    You don't need to add a test for a performance enhancement.

    @pitrou pitrou added performance Performance or resource usage and removed 3.8 only security fixes performance Performance or resource usage labels Nov 9, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Nov 9, 2017

    Of course, +1 for fixing this.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 9, 2017

    As for the C pickler, currently it dumps the whole pickle into an internal buffer before calling write() at the end. You may want to make writing more incremental. See Modules/_pickler.c (especially _Pickler_Write()).

    @serhiy-storchaka
    Copy link
    Member

    Would be nice to see benchmarks.

    And what about C version?

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 9, 2017

    I wrote a script to monitor the memory when dumping 2GB of data with python master (C pickler and Python pickler):

    (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 5.141s
    => peak memory usage: 4.014 GB
    (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 5.046s
    => peak memory usage: 5.955 GB
    

    This is using protocol 4. Note that the C pickler is only making 1 useless memory copy instead of 2 for the Python pickler (one for the concatenation and the other because of the framing mechanism of protocol 4).

    Here the output with the Python pickler fixed in #4353:

    (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 6.138s
    => peak memory usage: 2.014 GB
    

    Basically the 2 spurious memory copies of the Python pickler with protocol 4 are gone.

    Here is the script: https://gist.github.com/ogrisel/0e7b3282c84ae4a581f3b9ec1d84b45a

    @pitrou
    Copy link
    Member

    pitrou commented Nov 9, 2017

    But the total runtime is higher? (6 s. vs. 5 s.) Can you post the CPU time? (as measured by time, for example)

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 9, 2017

    Note that the time difference is not significant. I rerun the last command I got:

    (py37) ogrisel@ici:~/code/cpython$ python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 4.187s
    => peak memory usage: 2.014 GB
    

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 9, 2017

    More benchmarks with the unix time command:

    
    (py37) ogrisel@ici:~/code/cpython$ git checkout master
    Switched to branch 'master'
    Your branch is up-to-date with 'origin/master'.
    (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 10.677s
    => peak memory usage: 5.936 GB
    
    real	0m11.068s
    user	0m0.940s
    sys	0m5.204s
    (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 5.089s
    => peak memory usage: 5.978 GB
    
    real	0m5.367s
    user	0m0.840s
    sys	0m4.660s
    (py37) ogrisel@ici:~/code/cpython$ git checkout issue-31993-pypickle-dump-mem-optim 
    Switched to branch 'issue-31993-pypickle-dump-mem-optim'
    (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 6.974s
    => peak memory usage: 2.014 GB
    
    real	0m7.300s
    user	0m0.368s
    sys	0m4.640s
    (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 10.873s
    => peak memory usage: 2.014 GB
    
    real	0m11.178s
    user	0m0.324s
    sys	0m5.100s
    (py37) ogrisel@ici:~/code/cpython$ time python ~/tmp/large_pickle_dump.py --use-pypickle
    Allocating source data...
    => peak memory usage: 2.014 GB
    Dumping to disk...
    done in 4.233s
    => peak memory usage: 2.014 GB
    
    real	0m4.574s
    user	0m0.396s
    sys	0m4.368s
    

    User time is always better in the PR than on master but is also much slower than system time (disk access) in any case. System time is much less deterministic.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 9, 2017

    So we're saving memory and CPU time. Cool!

    @serhiy-storchaka
    Copy link
    Member

    Actually the time varies too much between runs. 1.641s ... 8.475s ... 12.645s

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 10, 2017

    In my last comment, I also reported the user times (not spend in OS level disk access stuff): the code of the PR is on the order of 300-400ms while master is around 800ms or more.

    @serhiy-storchaka
    Copy link
    Member

    I'll try to write the C implementation. Maybe it will use other heuristic.

    @serhiy-storchaka
    Copy link
    Member

    This speeds up pickling large bytes objects.

    $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000000 for i in range(256)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
    Unpatched:  10 loops, best of 5: 20.7 msec per loop
    Patched:    200 loops, best of 5: 1.12 msec per loop

    But slows down pickling short bytes objects longer than 256 bytes (up to 40%).

    $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*1000 for i in range(25600)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
    Unpatched:  5 loops, best of 5: 77.8 msec per loop
    Patched:    2 loops, best of 5: 98.5 msec per loop
    
    $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*256 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
    Unpatched:  1 loop, best of 5: 278 msec per loop
    Patched:    1 loop, best of 5: 382 msec per loop

    Compare with:

    $ ./python -m timeit -s 'import pickle; a = [bytes([i%256])*255 for i in range(100000)]' 'with open("/dev/null", "wb") as f: pickle._dump(a, f)'
    Unpatched:  1 loop, best of 5: 277 msec per loop
    Patched:    1 loop, best of 5: 273 msec per loop

    I think the code should be optimized for decreasing an overhead of _write_many().

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 10, 2017

    I have pushed a new version of the code that now has a 10% overhead for small bytes (instead of 40% previously).

    It could be possible to optimize further but I think that would render the code much less readable so I would be tempted to keep it this way.

    Please let me know what you think.

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 10, 2017

    Actually, I think this can still be improved while keeping it readable. Let me try again :)

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 10, 2017

    Alright, the last version has now ~4% overhead for small bytes.

    @serhiy-storchaka
    Copy link
    Member

    Nice! I have got virtually the same code as your intermediate variant, but your final variant event better!

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 10, 2017

    BTW, I am looking at the C implementation at the moment. I think I can do it.

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 10, 2017

    I have tried to implement the direct write bypass for the C version of the pickler but I get a segfault in a Py_INCREF on obj during the call to memo_put(self, obj) after the call to _Pickler_write_large_bytes.

    Here is the diff of my current version of the patch:

    ogrisel@4e093ad

    I am new to the Python C-API so I would appreciate some help on this one.

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 11, 2017

    Alright, I found the source of my refcounting bug. I updated the PR to include the C version of the dump for PyBytes.

    I ran Serhiy's microbenchmarks on the C version and I could not detect any overhead on small bytes objects while I get a ~20x speedup (and no-memory copy) on large bytes objects as expected.

    I would like to update the write_utf8 function but I would need to find a way to wrap const char* data as a PyBytes instance without making a memory copy to be able to pass it to my _Pickle_write_large_bytes. I browsed the C-API documentation but I could not understand how to do that.

    Also I would appreciate any feedback on the code style or things that could be improved in my PR.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 11, 2017

    I would like to update the write_utf8 function but I would need to find a way to wrap const char* data as a PyBytes instance without making a memory copy to be able to pass it to my _Pickle_write_large_bytes.

    You should pass it as a memoryview instead:
    https://docs.python.org/3/c-api/memoryview.html#c.PyMemoryView_FromMemory

    @serhiy-storchaka
    Copy link
    Member

    While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 12, 2017

    Thanks Antoine, I updated my code to what you suggested.

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 12, 2017

    While we are here, wouldn't be worth to flush the buffer in the C implementation to the disk always after committing a frame? This will save a memory when dump a lot of small objects.

    I think it's a good idea. The C pickler would behave more like the Python pickler. I think framing was intended this way initially. Antoine what do you think?

    @pitrou
    Copy link
    Member

    pitrou commented Nov 12, 2017

    Framing was originally intended to improve unpickling (since you don't have to issue lots of tiny file reads anymore). No objection to also improve pickling, though :-)

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Nov 12, 2017

    Flushing the buffer at each frame commit will cause a medium-sized write every 64kB on average (instead of one big write at the end). So that might actually cause a performance regression for some users if the individual file-object writes induce significant overhead.

    In practice though, latency inducing file objects like filesystem-backed ones are likely to derive from the BufferedWriter base class and the only latency we should really care about it the one induced by the write call overhead itself in which case the 64kB frame / buffer size should be enough.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 12, 2017

    Agreed. We shouldn't issue very small writes, but 64 kB is generally considered a reasonable buffer size for many kinds of I/O.

    Besides, it wouldn't be difficult to make the target frame size configurable if a use case arose for it, but I don't think we've ever had such a request.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 3cd7c6e by Serhiy Storchaka (Olivier Grisel) in branch 'master':
    bpo-31993: Do not allocate large temporary buffers in pickle dump. (bpo-4353)
    3cd7c6e

    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Jan 6, 2018

    Shall we close this issue now that the PR has been merged to master?

    @pitrou
    Copy link
    Member

    pitrou commented Jan 6, 2018

    Definitely! Thank you for your contribution :-)

    @pitrou pitrou closed this as completed Jan 6, 2018
    @ogrisel
    Copy link
    Mannequin Author

    ogrisel mannequin commented Jan 6, 2018

    Thanks for the very helpful feedback and guidance during the review.

    @serhiy-storchaka
    Copy link
    Member

    Humm, this feature doesn't work with C implementation.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 6, 2018

    Humm, this feature doesn't work with C implementation.

    What do you mean?

    @serhiy-storchaka
    Copy link
    Member

    PR 5114 implements this when serialize in C into memory.

    @serhiy-storchaka
    Copy link
    Member

    What do you mean?

    Large bytes and strings was written inside a frame when serialize by dumps().
    dump() (as well as Python implementations of dumps() and dump()) write them
    outside of a frame.

    @serhiy-storchaka
    Copy link
    Member

    I'll create a separate PR for the memoroview issue after merging PR 5114.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 0a2da50 by Serhiy Storchaka in branch 'master':
    bpo-31993: Do not create frames for large bytes and str objects (bpo-5114)
    0a2da50

    @serhiy-storchaka
    Copy link
    Member

    New changeset 5b76bdb by Serhiy Storchaka in branch 'master':
    bpo-31993: Do not use memoryview when pickle large strings. (bpo-5154)
    5b76bdb

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants