Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpickling is really slow #48123

Closed
hagen mannequin opened this issue Sep 15, 2008 · 38 comments
Closed

Unpickling is really slow #48123

hagen mannequin opened this issue Sep 15, 2008 · 38 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@hagen
Copy link
Mannequin

hagen mannequin commented Sep 15, 2008

BPO 3873
Nosy @jcea, @amauryfa, @abalkin, @pitrou, @vstinner, @avassalotti
Files
  • pickletst.py
  • pickle_read_buffer.patch
  • pickletest.py
  • unpickleprefetch.patch
  • unpickleprefetch2.patch
  • bench_pickle.py
  • unpickleprefetch3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-10-12.20:52:02.056>
    created_at = <Date 2008-09-15.16:37:16.466>
    labels = ['library', 'performance']
    title = 'Unpickling is really slow'
    updated_at = <Date 2011-03-19.03:00:12.650>
    user = 'https://bugs.python.org/hagen'

    bugs.python.org fields:

    activity = <Date 2011-03-19.03:00:12.650>
    actor = 'jcea'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-10-12.20:52:02.056>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2008-09-15.16:37:16.466>
    creator = 'hagen'
    dependencies = []
    files = ['12785', '13632', '13633', '18984', '19000', '19033', '19034']
    hgrepos = []
    issue_num = 3873
    keywords = ['patch']
    message_count = 38.0
    messages = ['73267', '73269', '73271', '73273', '73275', '74487', '76988', '80093', '80097', '80101', '80103', '80104', '85458', '85459', '85462', '85636', '85637', '85638', '85639', '85640', '85780', '86047', '110687', '111867', '111870', '111871', '111892', '111964', '117239', '117241', '117320', '117323', '117471', '117472', '117480', '117481', '117483', '118472']
    nosy_count = 9.0
    nosy_names = ['collinwinter', 'jcea', 'amaury.forgeotdarc', 'belopolsky', 'pitrou', 'vstinner', 'alexandre.vassalotti', 'hagen', 'BreamoreBoy']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue3873'
    versions = ['Python 3.2']

    @hagen
    Copy link
    Mannequin Author

    hagen mannequin commented Sep 15, 2008

    Unpickling e.g. a large list seems to be really slow in Python 3.0.

    The attached test script gives the following results for pickling and
    unpickling a list of 1M zeros, showing that although the C
    implementation seems to be used in Python 3.0, unpickling is even slower
    than with the "pickle" module of Python 2.6!

    [hagenf@cluster-06 hagenf]$ python pickletst.py
    2.71067500114
    2.77484893799
    [hagenf@cluster-06 hagenf]$ python3.0 pickletst.py
    0.0925059318542
    5.76121616364

    @hagen hagen mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Sep 15, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Sep 15, 2008

    Do the numbers vary if you read the whole file at once and then unpickle
    the resulting bytes string?
    Large parts of the IO library are written in Python in 3.0, which might
    explain the discrepancy.

    @hagen
    Copy link
    Mannequin Author

    hagen mannequin commented Sep 15, 2008

    Yes, it gets much better, but even so (first reading file and timing
    only "loads") unpickling takes four times as long in Python 3.0 as with
    the old cPickle module:

    [hagenf@cluster-06 hagenf]$ python pickletst2.py
    0.0744678974152
    0.0514161586761
    [hagenf@cluster-06 hagenf]$ python3.0 pickletst3.py
    0.0911619663239
    0.208593845367

    But I guess this can still be blamed on the BytesIO implementation...

    @amauryfa
    Copy link
    Member

    Indeed. If I replace the file with
    f = io.BytesIO(open("tst", "rb").read())
    timings are divided by 20...

    After quick profiling, it seems that PyLong_New would benefit from a
    free list. len(bytearray) is called very often.
    To stay simple, it would be enough to only store longs of length 1 (<2**15).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 15, 2008

    Gregory had patches for a freelist of long objects in bpo-2013.

    @avassalotti
    Copy link
    Member

    The solution is to add a read buffer to Unpickler (Pickler already has a
    write buffer, so that why it is unaffected). I believe this would
    mitigate much of the (quite large) Python function call overhead.

    cPickle has a performance hack to make it uses cStringIO and PyFile
    directly (via C function calls). In Python 3, the hack was removed since
    cStringIO.h and fileobject.h aren't public anymore. This explains the
    0.02s deviation you are getting for dump speed.

    @hagen
    Copy link
    Mannequin Author

    hagen mannequin commented Dec 5, 2008

    I think a read buffer is not possible without being able to unread bytes
    from the stream. pickle shoudn't consume bytes after the end of a
    pickled object!

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2009

    Making this a duplicate of bpo-4565 ("Rewrite the IO stack in C").
    If anyone disagrees, please reopen!

    @pitrou pitrou closed this as completed Jan 18, 2009
    @hagen
    Copy link
    Mannequin Author

    hagen mannequin commented Jan 18, 2009

    With the io-c branch I see much better unpickling performance than
    before. But it still seems to be around 2 or 3 times slower than with
    cPickle in 2.6.

    Is this expected at this point of io-c development? Otherwise perhaps
    this issue should stay open until it can be verified that nothing else
    can be done to get closer to the old cPickle performance.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2009

    Hello,

    With the io-c branch I see much better unpickling performance than
    before. But it still seems to be around 2 or 3 times slower than with
    cPickle in 2.6.

    It's much closer here.

    With 2.7 (trunk) and cPickle:
    0.439934968948
    0.669679880142
    With the io-c branch:
    0.136367082596
    0.798221111298

    Is this expected at this point of io-c development?

    20-30% slower is expected for binary I/O (which is what I get above),
    not 2x or 3x slower.
    What is your system?

    @hagen
    Copy link
    Mannequin Author

    hagen mannequin commented Jan 18, 2009

    I uploaded a new pickletst.py which specifies protocol 2, otherwise
    we're comparing apples with oranges. With this I get:

    0.211881160736
    0.322369813919

    for Python 2.6 and

    0.158488035202
    1.21621990204

    on the io-c branch. Can you confirm this?

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2009

    Nice catch! I can confirm your figures with protocol=2 (and protocol=-1
    as well).

    @pitrou pitrou reopened this Jan 18, 2009
    @vstinner
    Copy link
    Member

    vstinner commented Apr 5, 2009

    alexandre.vassalotti wrote:

    The solution is to add a read buffer to Unpickler (...)
    would mitigate much of the (quite large) Python function
    call overhead. (...) cPickle has a performance hack to make it
    uses cStringIO and PyFile directly (via C function calls). In
    Python 3, the hack was removed (...)

    Yes, unpickler_read() calls Buffered_read() through
    PyObject_Call+PyCFunction_Call which is expensive. And unpickle main
    loop starts with unpickler_read(self, &s, 1): just read *one* byte
    (the "opcode"). If Buffered_read() call is expensive, a solution is to
    avoid calling it (eg. read more bytes and... go backward at the end?).

    @vstinner
    Copy link
    Member

    vstinner commented Apr 5, 2009

    Unladen Swallow has a project to optimize pickle. Currently, it uses 3
    benchmarks:
    pickle - use the cPickle module to pickle a variety of datasets.
    pickle_dict - microbenchmark; use the cPickle module to pickle a lot
    of dicts.
    pickle_list - microbenchmark; use the cPickle module to pickle a lot
    of lists.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 5, 2009

    gprof (--enable-profiler) results:

    Each sample counts as 0.01 seconds.
    % cumulative self self total
    time seconds seconds calls ms/call ms/call name
    18.18 0.16 0.16 2011055 0.00 0.00 vgetargs1
    7.95 0.23 0.07 2017033 0.00 0.00
    PyObject_GenericGetAttr
    6.82 0.29 0.06 2002006 0.00 0.00
    _BufferedReader_read_unlocked
    5.68 0.34 0.05 2012123 0.00 0.00 convertsimple
    4.55 0.38 0.04 2013284 0.00 0.00 PyCFunction_Call
    4.55 0.42 0.04 2004543 0.00 0.00 PyLong_AsSsize_t
    3.41 0.45 0.03 2004565 0.00 0.00 PyNumber_Index
    3.41 0.48 0.03 2002177 0.00 0.00 float_argument_error
    3.41 0.51 0.03 1000497 0.00 0.00
    PyLong_AsLongAndOverflow
    2.27 0.53 0.02 2012123 0.00 0.00 convertitem

    It looks like PyArg_ParseTuple() should be optimized ;-)

    @vstinner
    Copy link
    Member

    vstinner commented Apr 6, 2009

    Create a read buffer (4096 bytes) in unpickler class. Using [0]*10**7 or
    [1000]*10**7, load() is from 6 to 8 times faster.

    I removed last_string attribute because it wasn't used.

    If there are "tail" bytes, seek backward.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 6, 2009

    I don't know why, but python-trunk is *much* slower than py3k (eg. with
    dump: 1000 ms vs 24 ms for py3k, or with load: 1500ms vs 186ms).

    @vstinner
    Copy link
    Member

    vstinner commented Apr 6, 2009

    My version of pickletest.py:

    • make sure that file position is correct after the load()
    • some benchmark. most interesting numbers:

    without the patch :

    version | data | dump ms | load ms |
    py3k | 0, 10^6 | 230 | 1500 |
    py3k | 1000, 10^6 | 255 | 1780 |
    py3k | 0, 10^7 | 2360 | 16044 |
    py3k | 1000, 10^7 | 2615 | 19380 |

    with the patch:

    version | data | dump ms | load ms | speed up:
    py3k | 0, 10^6 | 237 | 183 | x8
    py3k | 1000, 10^6 | 241 | 248 | x7
    py3k | 0, 10^7 | 2420 | 1860 | x8
    py3k | 1000, 10^7 | 2850 | 3100 | x6

    (data: 0, 10^6 means: [0]*10^6)

    @vstinner
    Copy link
    Member

    vstinner commented Apr 6, 2009

    Note about my patch: the buffer should be truncated after
    PyBytes_Concat(&self->buffer.pybytes, data) to avoid very long buffer.
    Something like: self->buffer.pybytes += data; self->buffer.pybytes =
    self->buffer.pybytes[index:]; self->buffer.index=0;

    @pitrou
    Copy link
    Member

    pitrou commented Apr 6, 2009

    I don't know why, but python-trunk is *much* slower than py3k (eg. with
    dump: 1000 ms vs 24 ms for py3k, or with load: 1500ms vs 186ms).

    Perhaps you tried with the pure Python version (pickle) rather than the
    C one (cPickle)?

    @avassalotti
    Copy link
    Member

    Victor, Unpickler shouldn't raise an error if the given file object does
    support seek(); it should gracefully fall back to using only read() and
    readline(). Also, I think you could get a greater performance
    improvement by using peek() and avoiding backward seeks. As far as I
    know, a backward seek may invalidate the internal buffer of the file
    object, thus forcing a read from disk.

    Incidentally, I think it would be nice to add to Python 3 some kind of
    file protocol as this would provide a way to mitigate the function call
    overhead for I/O operations.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 16, 2009

    By the way, the patch won't work with unseekable files, which is
    probably bad.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 18, 2010

    Has this slipped under the radar? I believe that one way or the other any performance issue should be resolved if at all possible.

    @vstinner
    Copy link
    Member

    New version of my patch:

    • add "used" attribute to UnpicklerBuffer structure: disable the read buffer for not seekable file and for protocol 0 (at the first call to unpickle_readline)
    • check if PyObject_GetAttrString(file, "seek") is NULL or not
    • unpickle_readline() flushs also the buffer
    • add a new patch specific to the read buffer: ensure that unpickler doesn't eat data at the end of the file

    test_pickle pass without any error.

    Disable read buffer at the first call to unpickle_readline() because unpickle_readline() have to flush the buffer. I will be very difficult to optimize protocol 0, but I hope that nobody uses it nowadays.

    ===========

    Benchmark with [0]*10**6, Python compiled with pydebug.

    Without the patch
    -----------------

    Protocol 0:

    • dump: 598.0 ms
    • load (seekable=False): 3337.3 ms
    • load (seekable=True): 3309.6 ms

    Protocol 1:

    • dump: 217.8 ms
    • load (seekable=False): 864.2 ms
    • load (seekable=True): 873.3 ms

    Protocol 2:

    • dump: 226.5 ms
    • load (seekable=False): 867.8 ms
    • load (seekable=True): 854.6 ms

    With the patch
    --------------

    Protocol 0

    • dump: 615.5 ms
    • load (seekable=False): 3201.3 ms
    • load (seekable=True): 3223.4 ms

    Protocol 1

    • dump: 219.8 ms
    • load (seekable=False): 942.1 ms
    • load (seekable=True): 175.2 ms

    Protocol 2

    • dump: 221.1 ms
    • load (seekable=False): 943.9 ms
    • load (seekable=True): 175.5 ms

    @vstinner
    Copy link
    Member

    Same benchmark with Python 2.6.5+, so without the patch, but compiled with maximum compiler optimization (whereas pydebug means no optimization):

    Protocol 0

    • dump: 517.3 ms
    • load: 876.6 ms <= because of the new I/O library, python3 is much slower

    Protocol 1

    • dump: 141.8 ms
    • load: 255.2 ms <= with my patch, python3 is 1.5x faster

    Protocol 2

    • dump: 142.7 ms
    • load: 262.1 ms <= with my patch, python3 is 1.5x faster

    It would be better to redo all tests with the same compiler options, but the most important point is that Python3 is *faster* than Python2 with my patch ;-)

    @vstinner
    Copy link
    Member

    bench_pickle.py: script used to produce last benchmarks.

    @avassalotti
    Copy link
    Member

    Victor, have you tried using peek() instead of seek()? I mentioned this previously in msg85780.

    @vstinner
    Copy link
    Member

    Victor, have you tried using peek() instead of seek()?
    I mentioned this previously in msg85780.

    In a file encoded in protocol 0, backward seek are needed to each call to unpickler_readline... and this function is called to read a number, a boolean, etc. (to read most, or all, data values). I choosed to disable the read buffer because it's slower with it.

    For protocol 1 and 2, there is only *one* seek at the end (but a lot of read: file size / 4096). So I don't think that it does really matter to use peek or seek.

    seek() is more natural (for me) and frequent than peek(), so I prefer to keep seek().

    @pitrou
    Copy link
    Member

    pitrou commented Sep 23, 2010

    Here is a fixed version of Victor's bench (didn't work on 2.x).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 23, 2010

    And here is new performance patch (Victor's patch was outdated because of heavy changes incorporated from Unladen Swallow). Results of bench_pickle.py are as follows:

    • Python 2.7 (cPickle):

    Protocol 0

    • dump: 189.8 ms
    • load (seekable=False): 318.1 ms
    • load (seekable=True): 321.2 ms
      Protocol 1
    • dump: 58.2 ms
    • load (seekable=False): 89.5 ms
    • load (seekable=True): 89.4 ms
      Protocol 2
    • dump: 58.0 ms
    • load (seekable=False): 89.1 ms
    • load (seekable=True): 89.1 ms
    • Python 3.2 (unpatched):

    Protocol 0

    • dump: 144.5 ms
    • load (seekable=False): 530.7 ms
    • load (seekable=True): 530.5 ms
      Protocol 1
    • dump: 27.8 ms
    • load (seekable=False): 280.4 ms
    • load (seekable=True): 279.9 ms
      Protocol 2
    • dump: 27.8 ms
    • load (seekable=False): 280.6 ms
    • load (seekable=True): 281.9 ms
    • Python 3.2 + patch:

    Protocol 0

    • dump: 146.4 ms
    • load (seekable=False): 611.1 ms
    • load (seekable=True): 194.2 ms
      Protocol 1
    • dump: 27.3 ms
    • load (seekable=False): 254.7 ms
    • load (seekable=True): 31.2 ms
      Protocol 2
    • dump: 27.3 ms
    • load (seekable=False): 254.6 ms
    • load (seekable=True): 31.2 ms

    @avassalotti
    Copy link
    Member

    I get this error with the patch:

    python: /home/alex/src/python.org/py3k/Modules/_pickle.c:908: _Unpickler_ReadFromFile: Assertion `self->next_read_idx == 0' failed.
    Aborted

    @pitrou
    Copy link
    Member

    pitrou commented Sep 24, 2010

    Ah, thank you. I hadn't tested in debug mode and there was a wrong assert from the previous code.
    Here is a patch with the assert removed.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 27, 2010

    One problem with the seek() approach is that some file-like objects have expensive seeks. One example is GzipFile, where seek(n) is O(n) (it first rewinds to the start of file, then reads n decompressed bytes). In the end, unpickling from a GzipFile becomes O(n**2).

    I will try to use peek() instead.

    @avassalotti
    Copy link
    Member

    Didn't Victor say that only one seek at the end is necessary per
    pickle? If this is the case, I don't think expensive seeks will be an
    issue.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 27, 2010

    Didn't Victor say that only one seek at the end is necessary per
    pickle? If this is the case, I don't think expensive seeks will be an
    issue.

    If you are unpickling from a multi-megabyte gzip file and the seek at
    the end makes you uncompress all the gzip file again, it can certainly
    be a problem.

    Another issue with seeking only at the end is that it would make
    readline() quite more complicated. And, apparently, readline() is not
    only used on protocol 0 but also on binary protocols.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 27, 2010

    Here is an update bench_pickle which also makes the file unpeekable.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 27, 2010

    Here is a patch using peek() rather than seek(). There are some inefficiencies around (such as using read() to skip the consumed prefetched bytes), but the benchmark results are still as good as with seek():

    Protocol 0

    • dump: 142.5 ms
    • load (seekable=False): 598.8 ms
    • load (seekable=True): 240.2 ms

    Protocol 1

    • dump: 28.9 ms
    • load (seekable=False): 253.2 ms
    • load (seekable=True): 25.7 ms

    Protocol 2

    • dump: 28.9 ms
    • load (seekable=False): 253.3 ms
    • load (seekable=True): 25.7 ms

    @pitrou
    Copy link
    Member

    pitrou commented Oct 12, 2010

    Patch committed in r85384.

    @pitrou pitrou closed this as completed Oct 12, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants