Shared Array Memory Allocation Regression #75102

DTasev · 2017-07-13T14:03:22Z

BPO	30919
Nosy	@pitrou, @gareth-rees, @serhiy-storchaka, @applio, @dtasev
PRs	bpo-30919: shared memory allocation performance regression in multiprocessing #2708
Files	shared_array_alloc.py: Simple timeit call to allocate 4GB memory

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-07-23.11:05:43.640>
created_at = <Date 2017-07-13.14:03:22.072>
labels = ['3.7', 'performance']
title = 'Shared Array Memory Allocation Regression'
updated_at = <Date 2017-07-23.11:05:43.639>
user = 'https://github.com/DTasev'

bugs.python.org fields:

activity = <Date 2017-07-23.11:05:43.639>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2017-07-23.11:05:43.640>
closer = 'pitrou'
components = []
creation = <Date 2017-07-13.14:03:22.072>
creator = 'dtasev'
dependencies = []
files = ['47009']
hgrepos = []
issue_num = 30919
keywords = []
message_count = 22.0
messages = ['298285', '298289', '298291', '298294', '298305', '298343', '298344', '298345', '298346', '298348', '298352', '298354', '298367', '298374', '298424', '298425', '298429', '298430', '298431', '298432', '298549', '298897']
nosy_count = 6.0
nosy_names = ['pitrou', 'sbt', 'gdr@garethrees.org', 'serhiy.storchaka', 'davin', 'dtasev']
pr_nums = ['2708']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue30919'
versions = ['Python 3.7']

DTasev · 2017-07-13T14:03:22Z

Hello, I have noticed a significant performance regression when allocating a large shared array in Python 3.x versus Python 2.7. The affected module seems to be multiprocessing.

The function I used for benchmarking:

from timeit import timeit
timeit('sharedctypes.Array(ctypes.c_float, 500*2048*2048)', 'from multiprocessing import sharedctypes; import ctypes', number=1)

And the results from executing it:

Python 3.5.2
Out[2]: 182.68500420999771

-------------------

Python 2.7.12
Out[6]: 2.124835968017578

I will try to provide any information you need. Right now I am looking at callgrind/cachegrind without Debug symbols, and can post that, in the meantime I am building Python with Debug and will re-run the callgrind/cachegrind.

Allocating the same-size array with numpy doesn't seem to have a difference between Python versions. The numpy command used was numpy.full((500,2048,2048), 5.0). Allocating the same number of list members also doesn't have a difference - arr = [5.0]*(500*2048*2048)

serhiy-storchaka · 2017-07-13T15:23:35Z

This is because instantiating new Arena is much slower in Python 3.

./python -m timeit -r1 -s 'from multiprocessing.heap import Arena' 'Arena(2**26)'

Python 3.7: 1 loop, best of 1: 1.18 sec per loop
Python 2.7: 100000 loops, best of 1: 9.19 usec per loop

May be the regression was introduced by bpo-8713.

serhiy-storchaka · 2017-07-13T15:43:28Z

It can be significantly sped up if replace the writing loop with

    if size:
        os.lseek(self.fd, size, 0)
        os.write(self.fd, b'')
        os.lseek(self.fd, 0, 0)

or just with

    os.truncate(self.fd, size)

(not sure the latter always works).

But instantiating an open Arena will still be slower than in 2.7.

gareth-rees · 2017-07-13T15:59:56Z

In Python 2.7, multiprocessing.heap.Arena uses an anonymous memory mapping on Unix. Anonymous memory mappings can be shared between processes but only via fork().

But Python 3 supports other ways of starting subprocesses (see bpo-8713 [1]) and so an anonymous memory mapping no longer works. So instead a temporary file is created, filled with zeros to the given size, and mapped into memory (see changeset 3b82e0d83bf9 [2]). It is the zero-filling of the temporary file that takes the time, because this forces the operating system to allocate space on the disk.

But why not use ftruncate() (instead of write()) to quickly create a file with holes? POSIX says [3], "If the file size is increased, the extended area shall appear as if it were zero-filled" which would seem to satisfy the requirement.

[1] https://bugs.python.org/issue8713
[2] https://hg.python.org/cpython/rev/3b82e0d83bf9
[3] http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

gareth-rees · 2017-07-13T18:13:14Z

Note that some filesystems (e.g. HFS+) don't support sparse files, so creating a large Arena will still be slow on these filesystems even if the file is created using ftruncate().

(This could be fixed, for the "fork" start method only, by using anonymous maps in that case.)

DTasev · 2017-07-14T08:50:58Z

If I understand correctly, there is no way to force the old behaviour in Python 3.5, i.e. to use an anonymous memory mapping in multiprocessing.heap.Arena so that the memory can be shared between the processes instead of writing to a shared file?

The data sizes usually are, on average, 20 GB, and writing it out to a file is not desirable. As I understand from Gareth Rees' comment, ftruncate() will speed up initialisation, however any processing afterwards would be IO capped.

To shed more light to the processing going on, the data is handled as a 3D array, so each process gets a 2D array to operate on, and no information needs to be shared between processes.

If the anonymous memory mapping cannot be forced, then the multiprocessing module with a shared array becomes unusable for me. Are you aware of any way to use the multiprocessing module to run execution in parallel, that somehow doesn't use a shared array?

gareth-rees · 2017-07-14T09:53:31Z

If you need the 2.7 behaviour (anonymous mappings) in 3.5 then you can still do it, with some effort. I think the approach that requires the smallest amount of work would be to ensure that subprocesses are started using fork(), by calling multiprocessing.set_start_method('fork'), and then monkey-patch multiprocessing.heap.Arena.__init__ so that it creates anonymous mappings using mmap.mmap(-1, size).

(I suggested above that Python could be modified to create anonymous mappings in the 'fork' case, but now that I look at the code in detail, I see that it would be tricky, because the Arena class has no idea about the Context in which it is going to be used -- at the moment you can create one shared object and then pass it to subprocesses under different Contexts, so the shared objects have to support the lowest common denominator.)

gareth-rees · 2017-07-14T10:00:35Z

Nonetheless this is bound to be a nasty performance for many people doing big data processing with NumPy/SciPy/Pandas and multiprocessing and moving from 2 to 3, so even if it can't be fixed, the documentation ought to warn about the problem and explain how to work around it.

DTasev · 2017-07-14T11:30:34Z

I have looked into your advice of changing multiprocessing.heap.Arena.__init__, I have removed the code that allocated the file and reverted to the old behaviour.

I have done some brief runs and it seems to bring back the old behaviour which is allocating the space in RAM, rather than with IO. I am not sure what things this might break, and it might make the other usages of multiprocessing unstable!

Can anyone think of anything this change might break? The Arena.__init__ code is the one from Python 2.7:

    class Arena(object):
        def __init__(self, size, fd=-1):
             self.size = size
             self.fd = fd  # still kept but is not used !
             self.buffer = mmap.mmap(-1, self.size)

There does not seem to be a difference regardless of the start method setting multiprocessing.set_start_method('fork') to be 'fork'.

gareth-rees · 2017-07-14T11:49:41Z

I see now that the default start method is 'fork' (except on Windows), so calling set_start_method is unnecessary.

Note that you don't have to edit multiprocessing/heap.py, you can "monkey-patch" it in the program that needs the anonymous mapping:

    from multiprocessing.heap import Arena

    def anonymous_arena_init(self, size, fd=-1):
        "Create Arena using an anonymous memory mapping."
        self.size = size
        self.fd = fd  # still kept but is not used !
        self.buffer = mmap.mmap(-1, self.size)

    Arena.__init__ = anonymous_arena_init

As for what it will break — any code that uses the 'spawn' or 'forkserver' start methods.

DTasev · 2017-07-14T12:55:58Z

Thank you, that is indeed the solution I ended up with, along with a large explanation of why it was necessary.

Do you think that it's worth updating the multiprocessing documentation to make users aware of that behaviour?

gareth-rees · 2017-07-14T13:35:49Z

I propose:

Ask Richard Oudkerk why in changeset 3b82e0d83bf9 the temporary file is zero-filled and not truncated. Perhaps there's some file system where this is necessary? (I tested HFS+ which doesn't support sparse files, and zero-filling seems not to be necessary, but maybe there's some other file system where it is?)
If there's no good reason for zero-filling the temporary file, replace it with a call to os.ftruncate(fd, size).
Update the documentation to mention the performance issue when porting multiprocessing code from 2 to 3. Unfortunately, I don't think there's any advice that the documentation can give that will help work around it -- monkey-patching works but is not supported.
Consider writing a fix, or at least a supported workaround. Here's a suggestion: update multiprocessing.sharedctypes and multiprocessing.heap so that they use anonymous maps in the 'fork' context. The idea is to update the RawArray and RawValue functions so that they take the context, and then pass the context down to _new_value, BufferWrapper.__init__ and thence to Heap.malloc where it can be used to determine what kind of Arena (file-backed or anonymous) should be used to satisfy the allocation request. The Heap class would have to have to segregate its blocks according to what kind of Arena they come from.

pitrou · 2017-07-14T18:24:29Z

The original changeset in Richard's repository is https://hg.python.org/sandbox/sbt/rev/6c8554a7d068. Unless Richard answers otherwise, I think it's likely the performance degradation was an oversight.

Given the code we're talking about is POSIX-specific, truncate() (or its sibling os.ftruncate()) is specified to always zero-fill. So it should be safe to use.

Another related point is that the arena file is created inside get_temp_dir(), which allocates something in /tmp. On modern Linux systems at least, /tmp is usually backed by persistent storage. Instead, we could use /run/user/{os.getuid()}, which uses a tmpfs and would therefore save on I/O when later accessing the shared array.

pitrou · 2017-07-14T20:58:07Z

Gareth, Dimitar, you may want to take a look at #2708

serhiy-storchaka · 2017-07-16T08:01:06Z

The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer). get_temp_dir() will be successful, but this may be not enough for allocating large shared array. And it is harder to control by user, because the location is not specified by an environment variable.

pitrou · 2017-07-16T08:35:52Z

The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer)

Hmm, you're right. Is /dev/shm ok on your computer?

serhiy-storchaka · 2017-07-16T09:11:49Z

Yes, /dev/shm is ok on my computer.

But there is yet one drawback of this patch. Currently the size of shared array is limited by the free space on the file system for temporary files. If it isn't enough you can easy set TMPDIR to other path. In 2.7 and with hardcoded /dev/shm it is limited by the size of physical memory (maybe plus swap). It is not easy to increase it, especially for unprivileged user.

The effect of hardcoding /dev/shm can be achieved by setting TMPDIR to /dev/shm. Using os.ftruncate() adds yet about 20% of speed up. This looks mainly as a documentation issue to me. The regression and the workaround should be documented.

pitrou · 2017-07-16T09:15:26Z

The main issue here is to restore performance of 2.7 version. Setting TMPDIR sounds too general and might have impact on other libraries. I think it's ok if shared array size is limited by virtual memory size on the computer, since the whole point is to expose a fast communication channel between processes.

serhiy-storchaka · 2017-07-16T09:24:38Z

This can break Python 3 applications that work with shared arrays larger than the size of physical memory. If do this change, it should be optional.

Setting TMPDIR=/dev/shm restores performance of 2.7 version even on versions that we will decide to not patch.

pitrou · 2017-07-16T09:37:31Z

Again, setting TMPDIR is a poor workaround as it will impact any library creating temporary files.

I haven't found anyone complaining about limited shared array size on 2.7, so likely it's not a common concern. Right now, anyone creating a huge shared array on 3.x will be facing huge performance issues, as the array is backed by disk, so it doesn't sound realistic at all to assume people are doing that.

pitrou · 2017-07-17T17:45:01Z

Serhiy, I've changed the strategy in PR 2708: now the partition free space is tested before deciding to locate the file there or not.

pitrou · 2017-07-23T11:05:29Z

New changeset 3051f0b by Antoine Pitrou in branch 'master':
bpo-30919: shared memory allocation performance regression in multiprocessing (bpo-2708)
3051f0b

DTasev mannequin added the performance Performance or resource usage label Jul 13, 2017

pitrou added the 3.7 (EOL) end of life label Jul 14, 2017

pitrou closed this as completed Jul 23, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared Array Memory Allocation Regression #75102

Shared Array Memory Allocation Regression #75102

DTasev mannequin commented Jul 13, 2017

DTasev mannequin commented Jul 13, 2017

serhiy-storchaka commented Jul 13, 2017

serhiy-storchaka commented Jul 13, 2017

gareth-rees mannequin commented Jul 13, 2017

gareth-rees mannequin commented Jul 13, 2017

DTasev mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

DTasev mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

DTasev mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

pitrou commented Jul 14, 2017

pitrou commented Jul 14, 2017

serhiy-storchaka commented Jul 16, 2017

pitrou commented Jul 16, 2017

serhiy-storchaka commented Jul 16, 2017

pitrou commented Jul 16, 2017

serhiy-storchaka commented Jul 16, 2017

pitrou commented Jul 16, 2017

pitrou commented Jul 17, 2017

pitrou commented Jul 23, 2017

Shared Array Memory Allocation Regression #75102

Shared Array Memory Allocation Regression #75102

Comments

DTasev mannequin commented Jul 13, 2017

DTasev mannequin commented Jul 13, 2017

serhiy-storchaka commented Jul 13, 2017

serhiy-storchaka commented Jul 13, 2017

gareth-rees mannequin commented Jul 13, 2017

gareth-rees mannequin commented Jul 13, 2017

DTasev mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

DTasev mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

DTasev mannequin commented Jul 14, 2017

gareth-rees mannequin commented Jul 14, 2017

pitrou commented Jul 14, 2017

pitrou commented Jul 14, 2017

serhiy-storchaka commented Jul 16, 2017

pitrou commented Jul 16, 2017

serhiy-storchaka commented Jul 16, 2017

pitrou commented Jul 16, 2017

serhiy-storchaka commented Jul 16, 2017

pitrou commented Jul 16, 2017

pitrou commented Jul 17, 2017

pitrou commented Jul 23, 2017