Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem using multiprocessing with really big objects? #61760

Closed
mrjbq7 mannequin opened this issue Mar 27, 2013 · 27 comments
Closed

problem using multiprocessing with really big objects? #61760

mrjbq7 mannequin opened this issue Mar 27, 2013 · 27 comments
Labels
3.8 only security fixes type-feature A feature request or enhancement

Comments

@mrjbq7
Copy link
Mannequin

mrjbq7 mannequin commented Mar 27, 2013

BPO 17560
Nosy @rhettinger, @pitrou, @mrjbq7, @serhiy-storchaka, @ogrisel, @applio, @smsaladi
PRs
  • bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection #10305
  • Files
  • multi.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-11-06.19:39:00.301>
    created_at = <Date 2013-03-27.15:52:01.185>
    labels = ['type-feature', '3.8']
    title = 'problem using multiprocessing with really big objects?'
    updated_at = <Date 2018-11-06.19:39:00.299>
    user = 'https://github.com/mrjbq7'

    bugs.python.org fields:

    activity = <Date 2018-11-06.19:39:00.299>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-11-06.19:39:00.301>
    closer = 'pitrou'
    components = []
    creation = <Date 2013-03-27.15:52:01.185>
    creator = 'mrjbq7'
    dependencies = []
    files = ['29595']
    hgrepos = []
    issue_num = 17560
    keywords = ['patch']
    message_count = 27.0
    messages = ['185344', '185345', '185351', '185352', '185355', '185356', '185357', '185366', '185368', '185371', '185373', '185375', '185377', '185384', '195647', '195648', '195657', '195671', '289524', '289527', '289528', '289548', '299185', '299460', '299479', '325843', '329378']
    nosy_count = 12.0
    nosy_names = ['rhettinger', 'pitrou', 'mrjbq7', 'neologix', 'sbt', 'serhiy.storchaka', 'Olivier.Grisel', 'davin', 'artxyz', 'i3v', 'Daniel Liu', 'saladi']
    pr_nums = ['10305']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17560'
    versions = ['Python 3.8']

    @mrjbq7
    Copy link
    Mannequin Author

    mrjbq7 mannequin commented Mar 27, 2013

    I ran into a problem using multiprocessing to create large data objects (in this case numpy float64 arrays with 90,000 columns and 5,000 rows) and return them to the original python process.

    It breaks in both Python 2.7 and 3.3, using numpy 1.7.0 (but with different error messages).

    It is possible the array is too large to be serialized (450 million 64-bit numbers exceeds a 32-bit limit)?

    Python 2.7
    ==========

    Process PoolWorker-1:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python2.7/multiprocessing/pool.py", line 99, in worker
        put((job, i, result))
      File "/usr/lib/python2.7/multiprocessing/queues.py", line 390, in put
        return send(obj)
    SystemError: NULL result without error in PyObject_Call

    Python 3.3
    ==========

    Traceback (most recent call last):
      File "multi.py", line 18, in <module>
        results = pool.map_async(make_data, range(5)).get(9999999)
      File "/usr/lib/python3.3/multiprocessing/pool.py", line 562, in get
        raise self._value
    multiprocessing.pool.MaybeEncodingError: Error sending result: '[array([[ 0.74628629,  0.36130663, -0.65984794, ..., -0.70921838,
             0.34389663, -1.7135126 ],
           [ 0.60266867, -0.40652402, -1.31590562, ...,  1.44896246,
            -0.3922366 , -0.85012842],
           [ 0.59629641, -0.00623001, -0.12914128, ...,  0.99925511,
            -2.30418136,  1.73414009],
           ..., 
           [ 0.24246639,  0.87519509,  0.24109069, ..., -0.48870107,
            -0.20910332,  0.11749621],
           [ 0.62108937, -0.86217542, -0.47357384, ...,  1.59872243,
             0.76639995, -0.56711461],
           [ 0.90976471,  1.73566475, -0.18191821, ...,  0.19784432,
            -0.29741643, -1.46375835]])]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'

    @pitrou
    Copy link
    Member

    pitrou commented Mar 27, 2013

    A multiprocessing queue currently uses a 32-bit signed int to encode object length (in bytes):

        def _send_bytes(self, buf):
            # For wire compatibility with 3.2 and lower
            n = len(buf)
            self._send(struct.pack("!i", n))
            # The condition is necessary to avoid "broken pipe" errors
            # when sending a 0-length buffer if the other end closed the pipe.
            if n > 0:
                self._send(buf)

    I *think* we need to keep compatibility with the wire format, but perhaps we could use a special length value (-1?) to introduce a longer (64-bit) length value.

    @pitrou pitrou added the type-feature A feature request or enhancement label Mar 27, 2013
    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 27, 2013

    I *think* we need to keep compatibility with the wire format, but perhaps
    we could use a special length value (-1?) to introduce a longer (64-bit)
    length value.

    Yes we could, although that would not help on Windows pipe connections (where byte oriented messages are used instead). Also, does pickle currently handle byte strings larger than 4GB?

    But I can't help feeling that multigigabyte arrays should be transferred using shared mmaps rather than serialization. numpy.frombuffer() could be used to recreate the array from the mmap.

    multiprocessing currently only allows sharing of such shared arrays using inheritance. Perhaps we need a picklable mmap type which can be sent over pipes and queues. (On Unix this would probably require fd passing.)

    @mrjbq7
    Copy link
    Mannequin Author

    mrjbq7 mannequin commented Mar 27, 2013

    On a machine with 256GB of RAM, it makes more sense to send arrays of this size than say on a laptop...

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 27, 2013

    On 27/03/2013 5:13pm, mrjbq7 wrote:

    On a machine with 256GB of RAM, it makes more sense to send arrays
    of this size than say on a laptop...

    I was thinking more of speed than memory consumption.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Mar 27, 2013

    Also, does pickle currently handle byte strings larger than 4GB?

    The 2.7 failure is indeed a pickle limitation, which should now be fixed by issue bpo-13555.

    On a machine with 256GB of RAM, it makes more sense to send arrays
    of this size than say on a laptop...

    Richard was saying that you shouldn't serialize such a large array, that's just a huge performance bottleneck. The right way would be to use a shared memory.

    multiprocessing currently only allows sharing of such shared arrays
    using inheritance.

    You mean through fork() COW?

    Perhaps we need a picklable mmap type which can be sent over pipes
    and queues. (On Unix this would probably require fd passing.)

    If you use POSIX semaphores, you could pass the semaphore path and use sem_open in the other process (but that would mean you can't unlink it right after open).

    @mrjbq7
    Copy link
    Mannequin Author

    mrjbq7 mannequin commented Mar 27, 2013

    Richard was saying that you shouldn't serialize such a large array,
    that's just a huge performance bottleneck. The right way would be
    to use a shared memory.

    Gotcha, for clarification, my original use case was to *create* them
    in the other process (something which took some time since they were
    calculated and not just random as in the example) and returned to the
    original process for further computation...

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 27, 2013

    On 27/03/2013 5:47pm, Charles-François Natali wrote:

    > multiprocessing currently only allows sharing of such shared arrays
    > using inheritance.

    You mean through fork() COW?

    Through fork, yes, but "shared" rather than "copy-on-write".

    > Perhaps we need a picklable mmap type which can be sent over pipes
    > and queues. (On Unix this would probably require fd passing.)

    If you use POSIX semaphores, you could pass the semaphore path and use
    sem_open in the other process (but that would mean you can't unlink it
    right after open).

    I assume you mean "shared memory" and shm_open(), not "semaphores" and
    sem_open(). I don't think shm_open() really has any advantages over
    using mmaps backed by "proper" files (since posix shared memeory uses up
    space in /dev/shm which is limited).

    By using fd passing you can get the operating system to do ref counting
    on the mmaps and not worry about when to unlink.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Mar 27, 2013

    Through fork, yes, but "shared" rather than "copy-on-write".

    There's a subtlety: because of refcounting, just treating a COW object
    as read-only (e.g. iteratin on the array) will trigger a copy
    anyway...

    I assume you mean "shared memory" and shm_open(), not "semaphores" and
    sem_open().

    Yes ;-)

    I don't think shm_open() really has any advantages over
    using mmaps backed by "proper" files (since posix shared memeory uses up
    space in /dev/shm which is limited).

    File-backed mmap() will incur disk I/O (although some of the data will
    probably sit in the page cache), which would be much slower than a
    shared memory. Also, you need corresponding disk space.
    As for the /dev/shm limit, it's normally dimensioned according to the
    amount of RAM, which is normally, which is in turn dimensioned
    according to the working set.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 27, 2013

    On 27/03/2013 7:27pm, Charles-François Natali wrote:

    Charles-François Natali added the comment:

    > Through fork, yes, but "shared" rather than "copy-on-write".

    There's a subtlety: because of refcounting, just treating a COW object
    as read-only (e.g. iteratin on the array) will trigger a copy
    anyway...

    I mean "write-through" (as opposed to "read-only" or "copy-on-write").

    > I don't think shm_open() really has any advantages over
    > using mmaps backed by "proper" files (since posix shared memeory uses up
    > space in /dev/shm which is limited).

    File-backed mmap() will incur disk I/O (although some of the data will
    probably sit in the page cache), which would be much slower than a
    shared memory. Also, you need corresponding disk space.
    As for the /dev/shm limit, it's normally dimensioned according to the
    amount of RAM, which is normally, which is in turn dimensioned
    according to the working set.

    Apart from creating, unlinking and resizing the file I don't think there
    should be any disk I/O.

    On Linux disk I/O only occurs when fsync() or close() are called.
    FreeBSD has a MAP_NOSYNC flag which gives Linux behaviour (otherwise
    dirty pages are flushed every 30-60).

    Once the file has been unlink()ed then any sensible operating system
    should realize it does not need to sync the file.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Mar 27, 2013

    Apart from creating, unlinking and resizing the file I don't think there
    should be any disk I/O.

    On Linux disk I/O only occurs when fsync() or close() are called.

    What?
    Writeback occurs depending on the memory pressure, percentage of used
    pages, page modification time, etc. Try writing a large file without
    closing it, you'll see that there's disk activity (or use
    iostat/vmstat).

    FreeBSD has a MAP_NOSYNC flag which gives Linux behaviour (otherwise
    dirty pages are flushed every 30-60).

    It's the same on Linux, depending on your mount options, data will be
    committed to disk every 5 seconds or so, when the journal is
    committed.

    Once the file has been unlink()ed then any sensible operating system
    should realize it does not need to sync the file.

    Why?
    Even if you delete the file right after open, if you write data to it,
    when the amount of data written fills your caches, the data has to go
    somewhere, even if only to make it available to the current process
    upon read()...

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 27, 2013

    On 27/03/2013 8:14pm, Charles-François Natali wrote:

    Charles-François Natali added the comment:

    > Apart from creating, unlinking and resizing the file I don't think there
    > should be any disk I/O.
    >
    > On Linux disk I/O only occurs when fsync() or close() are called.

    What?
    Writeback occurs depending on the memory pressure, percentage of used
    pages, page modification time, etc. Try writing a large file without
    closing it, you'll see that there's disk activity (or use
    iostat/vmstat).

    I meant when there is no memory pressure.

    > FreeBSD has a MAP_NOSYNC flag which gives Linux behaviour (otherwise
    > dirty pages are flushed every 30-60).

    It's the same on Linux, depending on your mount options, data will be
    committed to disk every 5 seconds or so, when the journal is
    committed.

    Googling suggsests that MAP_SHARED on Linux is equivalent to MAP_SHARED
    | MAP_NOSYNC on FreeBSD. I don't think it has anything to do with mount
    options.

    The Linux man page refuses to specify

    MAP_SHARED
    Share this mapping. Updates to the mapping are visible to other
    processes that map this file, and are carried through to the
    underlying file. **The file may not actually be updated until
    msync(2) or munmap() is called.**

    > Once the file has been unlink()ed then any sensible operating system
    > should realize it does not need to sync the file.

    Why?
    Even if you delete the file right after open, if you write data to it,
    when the amount of data written fills your caches, the data has to go
    somewhere, even if only to make it available to the current process
    upon read()...

    Can you demonstrate a slowdown with a benchmark?

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Mar 27, 2013

    I meant when there is no memory pressure.

    http://lwn.net/Articles/326552/
    """
    The kernel page cache contains in-memory copies of data blocks
    belonging to files kept in persistent storage. Pages which are written
    to by a processor, but not yet written to disk, are accumulated in
    cache and are known as "dirty" pages. The amount of dirty memory is
    listed in /proc/meminfo. Pages in the cache are flushed to disk after
    an interval of 30 seconds. Pdflush is a set of kernel threads which
    are responsible for writing the dirty pages to disk, either explicitly
    in response to a sync() call, or implicitly in cases when the page
    cache runs out of pages, if the pages have been in memory for too
    long, or there are too many dirty pages in the page cache (as
    specified by /proc/sys/vm/dirty_ratio).
    """

    >> FreeBSD has a MAP_NOSYNC flag which gives Linux behaviour (otherwise
    >> dirty pages are flushed every 30-60).
    >
    > It's the same on Linux, depending on your mount options, data will be
    > committed to disk every 5 seconds or so, when the journal is
    > committed.

    Googling suggsests that MAP_SHARED on Linux is equivalent to MAP_SHARED
    | MAP_NOSYNC on FreeBSD. I don't think it has anything to do with mount
    options.

    """
    MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
    physical media only when necessary (usually by the
    pager) rather than gratuitously.
    [...]
    """

    This just means that it will reduce synchronous writeback, but
    writeback will still occur (by what they call the pager).

    On Linux, writeback can be done by background kernel threads
    (pdflush), or synchrously on behalf of the process.

    The "mount option" thing is the following:
    if the file system is mounted with data=journal or data=ordered, data
    is written to disk before corresponding metadata is committed. And
    metadata is written when the journal is committed, by default every 5
    seconds:

    man mount:
    """
    ext3

           commit=nrsec       data={journal|ordered|writeback}
                  Specifies the journalling mode for file data.  Metadata
    is always journaled.  To use modes other than ordered on the root
    filesystem, pass the mode to the kernel
                  as boot parameter, e.g.  rootflags=data=journal.
              journal
                     All data is committed into the journal prior to
    

    being written into the main filesystem.

              ordered
                     This is the default mode.  All data is forced
    

    directly out to the main file system prior to its metadata being
    committed to the journal.

              writeback
                     Data ordering is not preserved - data may be
    

    written into the main filesystem after its metadata has been committed
    to the journal. This is rumoured to
    be the highest-throughput option. It guarantees
    internal filesystem integrity, however it can allow old data to appear
    in files after a crash and journal
    recovery.

           commit=nrsec
                  Sync all data and metadata every nrsec seconds. The
    default value is 5 seconds. Zero means default.
    """
    > The Linux man page refuses to specify
    >
    >    MAP_SHARED
    >      Share this mapping. Updates to the mapping are visible to other
    >      processes that map this file, and are carried through to the
    >      underlying file. **The file may not actually be updated until
    >      msync(2) or munmap() is called.**

    *may*,:just as fsync() is required to make sure data is committed to
    disk for a file, msync() is required for a mapping. But data is
    committed asynchronously or synchronously depending on different
    criterias (ratio of dirty pages, free memory, dirty pages age, etc).

    Can you demonstrate a slowdown with a benchmark?

    I could, but I don't have to: a shared memory won't incur any I/O or
    copy (except if it is swapped).
    A file-backed mmap will incur a *lot* of I/O: really, just try
    writting a 1GB file, and you'll see your disk spin, or use cat
    /proc/diskstats.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 27, 2013

    On 27/03/13 21:09, Charles-François Natali wrote:

    I could, but I don't have to: a shared memory won't incur any I/O or
    copy (except if it is swapped). A file-backed mmap will incur a *lot*
    of I/O: really, just try writting a 1GB file, and you'll see your disk
    spin, or use cat /proc/diskstats.

    You are right.

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Aug 19, 2013

    I have implemented a custom subclass of the multiprocessing Pool to be able plug custom pickling strategy for this specific use case in joblib:

    https://github.com/joblib/joblib/blob/master/joblib/pool.py#L327

    In particular it can:

    • detect mmap-backed numpy
    • transform large memory backed numpy arrays into numpy.memmap instances prior to pickling using the /dev/shm partition when available or TMPDIR otherwise.

    Here is some doc: https://github.com/joblib/joblib/blob/master/doc/parallel_numpy.rst

    I could submit the part that makes it possible to customize the picklers of multiprocessing.pool.Pool instance to the standard library if people are interested.

    The numpy specific stuff would stay in third party projects such as joblib but at least that would make it easier for people to plug their own optimizations without having to override half of the multiprocessing class hierarchy.

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Aug 19, 2013

    I forgot to end a sentence in my last comment:

    • detect mmap-backed numpy

    should read:

    • detect mmap-backed numpy arrays and pickle only the filename and other buffer metadata to reconstruct a mmap-backed array in the worker processes instead of copying the data around.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Aug 19, 2013

    I could submit the part that makes it possible to customize the picklers
    of multiprocessing.pool.Pool instance to the standard library if people
    are interested.

    2.7 and 3.3 are in bugfix mode now, so they will not change.

    In 3.3 you can do

        from multiprocessing.forking import ForkingPickler
        ForkingPickler.register(MyType, reduce_MyType)

    Is this sufficient for you needs? This is private (and its definition has moved in 3.4) but it could be made public.

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Aug 19, 2013

    In 3.3 you can do

    from multiprocessing.forking import ForkingPickler
    ForkingPickler.register(MyType, reduce_MyType)
    

    Is this sufficient for you needs? This is private (and its definition has moved in 3.4) but it could be made public.

    Indeed I forgot that the multiprocessing pickler was made already made
    pluggable in Python 3.3. I needed backward compat for python 2.6 in
    joblib hence I had to rewrite a bunch of the class hierarchy.

    @artxyz
    Copy link
    Mannequin

    artxyz mannequin commented Mar 13, 2017

    This is still an issue in Python 2.7.5

    Will it be fixed?

    @applio
    Copy link
    Member

    applio commented Mar 13, 2017

    @artxyz: The current release of 2.7 is 2.7.13 -- if you are still using 2.7.5 you might consider updating to the latest release.

    As pointed out in the text of the issue, the multiprocessing pickler has been made pluggable in 3.3 and it's been made more conveniently so in 3.6. The issue reported here arises from the constraints of working with large objects and pickle, hence the enhanced ability to take control of the multiprocessing pickler in 3.x applies.

    I'll assign this issue to myself as a reminder to create a blog post around this example and potentially include it as a motivating need for controlling the multiprocessing pickler in the documentation.

    @applio applio self-assigned this Mar 13, 2017
    @artxyz
    Copy link
    Mannequin

    artxyz mannequin commented Mar 13, 2017

    @davin Thanks for your answer! I will update to the current version.

    @serhiy-storchaka
    Copy link
    Member

    Pickle currently handle byte strings and unicode strings larger than 4GB only with protocol 4. But multiprocessing currently uses the default protocol which currently equals 3. There was suggestions to change the default pickle protocol (bpo-23403), the pickle protocol for multiprocessing (bpo-26507) or customize the serialization method for multiprocessing (bpo-28053). There is also a patch that implements the support of byte strings and unicode strings larger than 4GB with all protocols (bpo-25370).

    Beside this I think that using some kind of shared memory is better way for transferring large data between subprocesses.

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Mar 13, 2017
    @DanielLiu
    Copy link
    Mannequin

    DanielLiu mannequin commented Jul 26, 2017

    There's also the other multiprocessing limitation Antoine mentioned early on, where queues/pipes used a 32-bit signed integer to encode object length.

    Is there a way or plan to get around this limitation?

    @pitrou
    Copy link
    Member

    pitrou commented Jul 29, 2017

    There's also the other multiprocessing limitation Antoine mentioned early on, where queues/pipes used a 32-bit signed integer to encode object length.

    Is there a way or plan to get around this limitation?

    As I said above in https://bugs.python.org/issue17560#msg185345, it should be easy to improve the current protocol to allow for larger than 2GB data.

    @rhettinger
    Copy link
    Contributor

    Davin, when you write-up a blog post, I think it would be helpful to mention that creating really large objects with multi-processing is mostly an anti-pattern (the cost of pickling and interprocess communication tends to drown-out the benefits of parallel processing).

    @smsaladi
    Copy link
    Mannequin

    smsaladi mannequin commented Sep 20, 2018

    I apologize if this isn't the right forum to post this message, but Davin, if you have since put together the guide/blogpost mentioned in your message, would you be able to share a link to the material please?

    I am interested in gaining a better understanding of multiprocessing's pluggable pickler and, more generally, dealing with large objects (the comment about this being being an anti-pattern is noted).

    Thank you.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 6, 2018

    New changeset bccacd1 by Antoine Pitrou (Alexander Buchkovsky) in branch 'master':
    bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection (GH-10305)
    bccacd1

    @pitrou pitrou added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Nov 6, 2018
    @pitrou pitrou closed this as completed Nov 6, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants