This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Multiprocessing Pool.map pickles arguments passed to workers
Type: behavior Stage:
Components: Versions: Python 3.4
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Socob, davin, josh.r, kieleth
Priority: normal Keywords:

Created on 2015-04-16 23:14 by kieleth, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg241289 - (view) Author: Luis (kieleth) Date: 2015-04-16 23:14
Hi,

I've seen an odd behavior for multiprocessing Pool in Linux/MacOS:

-----------------------------
import multiprocessing as mp
from sys import getsizeof
import numpy as np


def f_test(x):
    print('process has received argument %s' % x )
    r = x[:100] # return will put in a queue for Pool, for objects > 4GB pickle complains
    return r

if __name__ == '__main__':
    # 2**28 runs ok, 2**29 or bigger breaks pickle
    big_param = np.random.random(2**29)

    # Process+big_parameter OK:
    proc = mp.Process(target=f_test, args=(big_param,))
    res = proc.start()
    proc.join()
    print('size of process result', getsizeof(res))

    # Pool+big_parameter BREAKS:
    pool = mp.Pool(1)
    res = pool.map(f_test, (big_param,))
    print('size of Pool result', getsizeof(res))

-----------------------------
$ python bug_mp.py
process has received argument [ 0.65282086  0.34977429  0.64148342 ...,  0.79902495  0.31427761
  0.02678803]
size of process result 16
Traceback (most recent call last):
  File "bug_mp.py", line 26, in <module>
    res = pool.map(f_test, (big_param,))
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 599, in get
    raise self._value
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 383, in _handle_tasks
    put(task)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/connection.py", line 206, in send
    self._send_bytes(ForkingPickler.dumps(obj))
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

-----------------------------
There's another flavor of error seen in similar scenario:
...
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

-----------------------------
Tested in:
Python 3.4.2 |Anaconda 2.1.0 (64-bit)| (default, Oct 21 2014, 17:16:37)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
And in:
Python 3.4.3 (default, Apr  9 2015, 16:03:56)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin

-----------------------------

Pool.map creates a "task Queue" to handle workers, and I think that but by doing this we are forcing any arguments passed to the workers to be pickled.
Process works OK, since no queue is created, it just forks.

My expectation would be that since we are in POSIX and forking, we shouldn't have to worry about arguments being pickled, and if this is expected behavior, it should be warned/documented (hope I've not missed this in the docs).

For small sized arguments, pickling-unpicking may not be an issue, but for big ones then, it is (I am aware of the Array and MemShare options).

Anybody has seen something similar, is perhaps this a hard requirement to Pool.map or I'm completely missing the point altogether?
msg241296 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-04-17 00:02
The nature of a Pool precludes assumptions about the availability of specific objects in a forked worker process (particularly now that there are alternate methods of forking processes). Since the workers are spun up when the pool is created, objects created or modified after that point would have to be serialized by some mechanism anyway.

The Pool class doesn't describe this explicitly, there are multiple references to this behavior (e.g. https://docs.python.org/3/library/multiprocessing.html#all-start-methods mentions inheriting as being more efficient than pickling/unpickling; you have to develop with inheritance in mind though; pickling, particular for task dispatch approaches in the "Futures" model, can't be generalized as an inheritance problem when using producer/consumer based worker model).

Point is, this is an expected behavior. You need some means of transferring objects between processes, and pickling is the Python standard serialization method. The inability to serialize a 4+ GB bytes object is a problem I assume (don't know if a bug exists for that), but pickling as the mechanism is the only obvious way to do it. If you want to avoid inheritance, it's up to you to ensure the root process has created the necessary bytes object prior to creating the Pool, and conveying information to the worker about how to find it (say, a dict of int keys to your bytes object data) in its own memory.
msg241372 - (view) Author: Luis (kieleth) Date: 2015-04-17 22:45
Thanks for answer, although I still think I haven't made myself fully understood here, allow me to paraphrase:
"...You need some means of transferring objects between processes, and pickling is the Python standard serialization method" 

Yes, but the question that stands is why Pool has to use a multiprocess.Queue to load and spin the workers (therefore pickling-unpickling their arguments), whereas we should just inheriting in that moment and then just create a Queue for the returns of the workers.

This applies to method "fork", not to "spawn", and not sure for "fork server".

Plus, I'm not trying to avoid inheritance, I'm trying to use it with Pools and large arguments as theoretically allowed by forking, and instead at the moment I'm forced to use Processes with a Queue for the results, as shown in the code above.

"OverflowError: cannot serialize a bytes object larger than 4 GiB" is just what allows us to expose this behavior, cause the Pool pickles the arguments without, in my opinion, having to do so.
msg241390 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-04-18 01:46
The Pool workers are created eagerly, not lazily. That is, the fork occurs before map is called, and Python can't know that the objects passed as arguments were inherited in the first place (since they could be created after the Pool was created). If you created worker tasks lazily, then sure, you could fork and use the objects that are inherited, but that's not how Pools work, and they can't work like that if the worker processes process more than one input without eagerly evaluating input sequences (which would introduce other problems).
msg241455 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2015-04-18 19:48
Though it's been discussed elsewhere, issue17560 is a good one where the matter of "really big" objects are being communicated between processes via multiprocessing.  In it, Richard shares some detail about the implementation in multiprocessing, its constraints and motivation.

That discussion also highlights that the pickler has been made pluggable in multiprocessing since Python 3.3.  That is, if you wish, you can use something other than Python's pickle to serialize objects and, in the extreme, potentially communicate them in a completely new way (perhaps even via out-of-band communication, though that was not the intention and would be arguably extreme).

I do not think Python's pickle is necessarily what we should expect multiprocessing to use to help communicate objects between processes.  Just because pickle is Python's serialization strategy does not also mean it must necessarily also be used in such communications.  Thankfully, we have the convenience of using something other than pickle (or newer versions of pickle, since there have been versioned updates to pickle's format over time with some promising improvements).

@Luis:  To your specific question about the need for Queue, the benefits of a consistent behavior and methodology whether forking/spawning on one OS with its caveats versus another are just that:  simplicity.  The pluggable nature of the pickler opens up the opportunity for folks (perhaps such as yourself) to construct and plug-in optimized solutions for scenarios or edge cases of particular interest.  The fact that we all start with a consistent behavior across platforms and process creation strategies is very appealing from a reduced complexity point of view.
msg241757 - (view) Author: Luis (kieleth) Date: 2015-04-21 23:54
Thanks for information and explanations.

The option of writing a tweaked serialization mechanism in Queue for Pool and implement a sharedmem sounds like fun, not sure if the pure-copy-on-write of forking can be achieved tho, it would be nice to know if it is actually possible (the project mentioned in issue17560 still needs to "dump" the arrays in the filesystem)

As quick fix for us, I've created a simple wrapper around Pool and its map, it creates a Queue for the results and uses Process to start the workers, this works just fine.

Simplicity and consistency are great, but I still believe that Pool, in LINUX-based systems, by serializing arguments, creates duplication and works inefficiently, and this could be avoided.

Obviously it's not me who takes the decisions and I don't have the time to investigate it further, so, after this petty rant, should we close this bug? :>
msg241825 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2015-04-22 21:10
@Luis:  Before closing this bug, could you give a quick description of the fix you put in place?

Two reasons:
1) Someone else encountering this frustration with similar needs might well benefit from hearing what you did.
2) To provide evidence that there is demand for a more efficient means of communication, demonstrated through workarounds.

Nothing fancy required -- not even code necessarily.
History
Date User Action Args
2022-04-11 14:58:15adminsetgithub: 68167
2021-04-01 13:11:40Socobsetnosy: + Socob
2015-09-21 03:18:47davinsetstatus: open -> closed
2015-04-22 21:10:33davinsetmessages: + msg241825
2015-04-21 23:54:40kielethsetmessages: + msg241757
2015-04-18 19:48:37davinsetnosy: + davin
messages: + msg241455
2015-04-18 01:47:00josh.rsetmessages: + msg241390
2015-04-17 22:45:15kielethsetmessages: + msg241372
2015-04-17 00:02:29josh.rsetnosy: + josh.r
messages: + msg241296
2015-04-16 23:14:09kielethcreate