classification
Title: Shared Array Memory Allocation Regression
Type: performance Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: davin, dtasev, gdr@garethrees.org, pitrou, sbt, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-07-13 14:03 by dtasev, last changed 2017-07-23 11:05 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
shared_array_alloc.py dtasev, 2017-07-13 14:03 Simple timeit call to allocate 4GB memory
Pull Requests
URL Status Linked Edit
PR 2708 merged pitrou, 2017-07-14 19:00
Messages (22)
msg298285 - (view) Author: Dimitar Tasev (dtasev) * Date: 2017-07-13 14:03
Hello, I have noticed a significant performance regression when allocating a large shared array in Python 3.x versus Python 2.7. The affected module seems to be `multiprocessing`.

The function I used for benchmarking:

from timeit import timeit
timeit('sharedctypes.Array(ctypes.c_float, 500*2048*2048)', 'from multiprocessing import sharedctypes; import ctypes', number=1)

And the results from executing it:

Python 3.5.2
Out[2]: 182.68500420999771

-------------------

Python 2.7.12
Out[6]: 2.124835968017578

I will try to provide any information you need. Right now I am looking at callgrind/cachegrind without Debug symbols, and can post that, in the meantime I am building Python with Debug and will re-run the callgrind/cachegrind.

Allocating the same-size array with numpy doesn't seem to have a difference between Python versions. The numpy command used was `numpy.full((500,2048,2048), 5.0)`. Allocating the same number of list members also doesn't have a difference - `arr = [5.0]*(500*2048*2048)`
msg298289 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-13 15:23
This is because instantiating new Arena is much slower in Python 3.

./python -m timeit -r1 -s 'from multiprocessing.heap import Arena' 'Arena(2**26)'

Python 3.7: 1 loop, best of 1: 1.18 sec per loop
Python 2.7: 100000 loops, best of 1: 9.19 usec per loop

May be the regression was introduced by issue8713.
msg298291 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-13 15:43
It can be significantly sped up if replace the writing loop with

    if size:
        os.lseek(self.fd, size, 0)
        os.write(self.fd, b'')
        os.lseek(self.fd, 0, 0)

or just with

    os.truncate(self.fd, size)

(not sure the latter always works).

But instantiating an open Arena will still be slower than in 2.7.
msg298294 - (view) Author: Gareth Rees (gdr@garethrees.org) * Date: 2017-07-13 15:59
In Python 2.7, multiprocessing.heap.Arena uses an anonymous memory mapping on Unix. Anonymous memory mappings can be shared between processes but only via fork().

But Python 3 supports other ways of starting subprocesses (see issue 8713 [1]) and so an anonymous memory mapping no longer works. So instead a temporary file is created, filled with zeros to the given size, and mapped into memory (see changeset 3b82e0d83bf9 [2]). It is the zero-filling of the temporary file that takes the time, because this forces the operating system to allocate space on the disk.

But why not use ftruncate() (instead of write()) to quickly create a file with holes? POSIX says [3], "If the file size is increased, the extended area shall appear as if it were zero-filled" which would seem to satisfy the requirement.

[1] https://bugs.python.org/issue8713
[2] https://hg.python.org/cpython/rev/3b82e0d83bf9
[3] http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html
msg298305 - (view) Author: Gareth Rees (gdr@garethrees.org) * Date: 2017-07-13 18:13
Note that some filesystems (e.g. HFS+) don't support sparse files, so creating a large Arena will still be slow on these filesystems even if the file is created using ftruncate().

(This could be fixed, for the "fork" start method only, by using anonymous maps in that case.)
msg298343 - (view) Author: Dimitar Tasev (dtasev) * Date: 2017-07-14 08:50
If I understand correctly, there is no way to force the old behaviour in Python 3.5, i.e. to use an anonymous memory mapping in multiprocessing.heap.Arena so that the memory can be shared between the processes instead of writing to a shared file?

The data sizes usually are, on average, 20 GB, and writing it out to a file is not desirable. As I understand from Gareth Rees' comment, ftruncate() will speed up initialisation, however any processing afterwards would be IO capped.

To shed more light to the processing going on, the data is handled as a 3D array, so each process gets a 2D array to operate on, and no information needs to be shared between processes.

If the anonymous memory mapping cannot be forced, then the multiprocessing module with a shared array becomes unusable for me. Are you aware of any way to use the multiprocessing module to run execution in parallel, that somehow doesn't use a shared array?
msg298344 - (view) Author: Gareth Rees (gdr@garethrees.org) * Date: 2017-07-14 09:53
If you need the 2.7 behaviour (anonymous mappings) in 3.5 then you can still do it, with some effort. I think the approach that requires the smallest amount of work would be to ensure that subprocesses are started using fork(), by calling multiprocessing.set_start_method('fork'), and then monkey-patch multiprocessing.heap.Arena.__init__ so that it creates anonymous mappings using mmap.mmap(-1, size).

(I suggested above that Python could be modified to create anonymous mappings in the 'fork' case, but now that I look at the code in detail, I see that it would be tricky, because the Arena class has no idea about the Context in which it is going to be used -- at the moment you can create one shared object and then pass it to subprocesses under different Contexts, so the shared objects have to support the lowest common denominator.)
msg298345 - (view) Author: Gareth Rees (gdr@garethrees.org) * Date: 2017-07-14 10:00
Nonetheless this is bound to be a nasty performance for many people doing big data processing with NumPy/SciPy/Pandas and multiprocessing and moving from 2 to 3, so even if it can't be fixed, the documentation ought to warn about the problem and explain how to work around it.
msg298346 - (view) Author: Dimitar Tasev (dtasev) * Date: 2017-07-14 11:30
I have looked into your advice of changing multiprocessing.heap.Arena.__init__, I have removed the code that allocated the file and reverted to the old behaviour.

I have done some brief runs and it seems to bring back the old behaviour which is allocating the space in RAM, rather than with IO. I am not sure what things this might break, and it might make the other usages of multiprocessing unstable! 

Can anyone think of anything this change might break? The Arena.__init__ code is the one from Python 2.7:

    class Arena(object):
        def __init__(self, size, fd=-1):
             self.size = size
             self.fd = fd  # still kept but is not used !
             self.buffer = mmap.mmap(-1, self.size)

There does not seem to be a difference regardless of the start method setting multiprocessing.set_start_method('fork') to be 'fork'.
msg298348 - (view) Author: Gareth Rees (gdr@garethrees.org) * Date: 2017-07-14 11:49
I see now that the default start method is 'fork' (except on Windows), so calling set_start_method is unnecessary.

Note that you don't have to edit multiprocessing/heap.py, you can "monkey-patch" it in the program that needs the anonymous mapping:

    from multiprocessing.heap import Arena

    def anonymous_arena_init(self, size, fd=-1):
        "Create Arena using an anonymous memory mapping."
        self.size = size
        self.fd = fd  # still kept but is not used !
        self.buffer = mmap.mmap(-1, self.size)

    Arena.__init__ = anonymous_arena_init

As for what it will break — any code that uses the 'spawn' or 'forkserver' start methods.
msg298352 - (view) Author: Dimitar Tasev (dtasev) * Date: 2017-07-14 12:55
Thank you, that is indeed the solution I ended up with, along with a large explanation of why it was necessary. 

Do you think that it's worth updating the `multiprocessing` documentation to make users aware of that behaviour?
msg298354 - (view) Author: Gareth Rees (gdr@garethrees.org) * Date: 2017-07-14 13:35
I propose:

1. Ask Richard Oudkerk why in changeset 3b82e0d83bf9 the temporary file is zero-filled and not truncated. Perhaps there's some file system where this is necessary? (I tested HFS+ which doesn't support sparse files, and zero-filling seems not to be necessary, but maybe there's some other file system where it is?)

2. If there's no good reason for zero-filling the temporary file, replace it with a call to os.ftruncate(fd, size).

3. Update the documentation to mention the performance issue when porting multiprocessing code from 2 to 3. Unfortunately, I don't think there's any advice that the documentation can give that will help work around it -- monkey-patching works but is not supported.

4. Consider writing a fix, or at least a supported workaround. Here's a suggestion: update multiprocessing.sharedctypes and multiprocessing.heap so that they use anonymous maps in the 'fork' context. The idea is to update the RawArray and RawValue functions so that they take the context, and then pass the context down to _new_value, BufferWrapper.__init__ and thence to Heap.malloc where it can be used to determine what kind of Arena (file-backed or anonymous) should be used to satisfy the allocation request. The Heap class would have to have to segregate its blocks according to what kind of Arena they come from.
msg298367 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-14 18:24
The original changeset in Richard's repository is https://hg.python.org/sandbox/sbt/rev/6c8554a7d068.  Unless Richard answers otherwise, I think it's likely the performance degradation was an oversight.

Given the code we're talking about is POSIX-specific, truncate() (or its sibling os.ftruncate()) is specified to always zero-fill.  So it should be safe to use.

Another related point is that the arena file is created inside get_temp_dir(), which allocates something in /tmp.  On modern Linux systems at least, /tmp is usually backed by persistent storage.  Instead, we could use /run/user/{os.getuid()}, which uses a tmpfs and would therefore save on I/O when later accessing the shared array.
msg298374 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-14 20:58
Gareth, Dimitar, you may want to take a look at https://github.com/python/cpython/pull/2708
msg298424 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-16 08:01
The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer). get_temp_dir() will be successful, but this may be not enough for allocating large shared array. And it is harder to control by user, because the location is not specified by an environment variable.
msg298425 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-16 08:35
> The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer)

Hmm, you're right.  Is /dev/shm ok on your computer?
msg298429 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-16 09:11
Yes, /dev/shm is ok on my computer.

But there is yet one drawback of this patch. Currently the size of shared array is limited by the free space on the file system for temporary files. If it isn't enough you can easy set TMPDIR to other path. In 2.7 and with hardcoded /dev/shm it is limited by the size of physical memory (maybe plus swap). It is not easy to increase it, especially for unprivileged user.

The effect of hardcoding /dev/shm can be achieved by setting TMPDIR to /dev/shm. Using os.ftruncate() adds yet about 20% of speed up. This looks mainly as a documentation issue to me. The regression and the workaround should be documented.
msg298430 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-16 09:15
The main issue here is to restore performance of 2.7 version.  Setting TMPDIR sounds too general and might have impact on other libraries.  I think it's ok if shared array size is limited by virtual memory size on the computer, since the whole point is to expose a fast communication channel between processes.
msg298431 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-16 09:24
This can break Python 3 applications that work with shared arrays larger than the size of physical memory. If do this change, it should be optional.

Setting TMPDIR=/dev/shm restores performance of 2.7 version even on versions that we will decide to not patch.
msg298432 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-16 09:37
Again, setting TMPDIR is a poor workaround as it will impact any library creating temporary files.

I haven't found anyone complaining about limited shared array size on 2.7, so likely it's not a common concern.  Right now, anyone creating a huge shared array on 3.x will be facing huge performance issues, as the array is backed by disk, so it doesn't sound realistic at all to assume people are doing that.
msg298549 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-17 17:45
Serhiy, I've changed the strategy in PR 2708: now the partition free space is tested before deciding to locate the file there or not.
msg298897 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-07-23 11:05
New changeset 3051f0b78e53d1b771b49375dc139ca13f9fd76e by Antoine Pitrou in branch 'master':
bpo-30919: shared memory allocation performance regression in multiprocessing (#2708)
https://github.com/python/cpython/commit/3051f0b78e53d1b771b49375dc139ca13f9fd76e
History
Date User Action Args
2017-07-23 11:05:43pitrousetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-07-23 11:05:29pitrousetmessages: + msg298897
2017-07-17 17:45:00pitrousetmessages: + msg298549
2017-07-16 09:37:31pitrousetmessages: + msg298432
2017-07-16 09:24:37serhiy.storchakasetmessages: + msg298431
2017-07-16 09:15:26pitrousetmessages: + msg298430
2017-07-16 09:11:49serhiy.storchakasetmessages: + msg298429
2017-07-16 08:35:52pitrousetmessages: + msg298425
2017-07-16 08:01:06serhiy.storchakasetmessages: + msg298424
2017-07-14 20:58:07pitrousetmessages: + msg298374
stage: needs patch -> patch review
2017-07-14 19:00:47pitrousetpull_requests: + pull_request2774
2017-07-14 18:24:28pitrousetversions: + Python 3.7, - Python 2.7, Python 3.5, Python 3.6
nosy: + pitrou

messages: + msg298367

stage: needs patch
2017-07-14 13:35:49gdr@garethrees.orgsetmessages: + msg298354
2017-07-14 12:55:58dtasevsetmessages: + msg298352
2017-07-14 11:49:41gdr@garethrees.orgsetmessages: + msg298348
2017-07-14 11:30:33dtasevsetmessages: + msg298346
2017-07-14 10:00:34gdr@garethrees.orgsetmessages: + msg298345
2017-07-14 09:53:31gdr@garethrees.orgsetmessages: + msg298344
2017-07-14 08:50:58dtasevsetmessages: + msg298343
2017-07-13 18:13:14gdr@garethrees.orgsetmessages: + msg298305
2017-07-13 15:59:56gdr@garethrees.orgsetnosy: + gdr@garethrees.org
messages: + msg298294
2017-07-13 15:43:27serhiy.storchakasetmessages: + msg298291
2017-07-13 15:23:34serhiy.storchakasetnosy: + serhiy.storchaka, davin, sbt
messages: + msg298289
2017-07-13 14:03:22dtasevcreate