This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author kieleth
Recipients kieleth
Date 2015-04-16.23:14:08
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1429226049.08.0.139422751174.issue23979@psf.upfronthosting.co.za>
In-reply-to
Content
Hi,

I've seen an odd behavior for multiprocessing Pool in Linux/MacOS:

-----------------------------
import multiprocessing as mp
from sys import getsizeof
import numpy as np


def f_test(x):
    print('process has received argument %s' % x )
    r = x[:100] # return will put in a queue for Pool, for objects > 4GB pickle complains
    return r

if __name__ == '__main__':
    # 2**28 runs ok, 2**29 or bigger breaks pickle
    big_param = np.random.random(2**29)

    # Process+big_parameter OK:
    proc = mp.Process(target=f_test, args=(big_param,))
    res = proc.start()
    proc.join()
    print('size of process result', getsizeof(res))

    # Pool+big_parameter BREAKS:
    pool = mp.Pool(1)
    res = pool.map(f_test, (big_param,))
    print('size of Pool result', getsizeof(res))

-----------------------------
$ python bug_mp.py
process has received argument [ 0.65282086  0.34977429  0.64148342 ...,  0.79902495  0.31427761
  0.02678803]
size of process result 16
Traceback (most recent call last):
  File "bug_mp.py", line 26, in <module>
    res = pool.map(f_test, (big_param,))
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 599, in get
    raise self._value
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 383, in _handle_tasks
    put(task)
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/connection.py", line 206, in send
    self._send_bytes(ForkingPickler.dumps(obj))
  File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

-----------------------------
There's another flavor of error seen in similar scenario:
...
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

-----------------------------
Tested in:
Python 3.4.2 |Anaconda 2.1.0 (64-bit)| (default, Oct 21 2014, 17:16:37)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
And in:
Python 3.4.3 (default, Apr  9 2015, 16:03:56)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin

-----------------------------

Pool.map creates a "task Queue" to handle workers, and I think that but by doing this we are forcing any arguments passed to the workers to be pickled.
Process works OK, since no queue is created, it just forks.

My expectation would be that since we are in POSIX and forking, we shouldn't have to worry about arguments being pickled, and if this is expected behavior, it should be warned/documented (hope I've not missed this in the docs).

For small sized arguments, pickling-unpicking may not be an issue, but for big ones then, it is (I am aware of the Array and MemShare options).

Anybody has seen something similar, is perhaps this a hard requirement to Pool.map or I'm completely missing the point altogether?
History
Date User Action Args
2015-04-16 23:14:09kielethsetrecipients: + kieleth
2015-04-16 23:14:09kielethsetmessageid: <1429226049.08.0.139422751174.issue23979@psf.upfronthosting.co.za>
2015-04-16 23:14:09kielethlinkissue23979 messages
2015-04-16 23:14:08kielethcreate