Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared Array Memory Allocation Regression #75102

Closed
DTasev mannequin opened this issue Jul 13, 2017 · 22 comments
Closed

Shared Array Memory Allocation Regression #75102

DTasev mannequin opened this issue Jul 13, 2017 · 22 comments
Labels
3.7 (EOL) end of life performance Performance or resource usage

Comments

@DTasev
Copy link
Mannequin

DTasev mannequin commented Jul 13, 2017

BPO 30919
Nosy @pitrou, @gareth-rees, @serhiy-storchaka, @applio, @dtasev
PRs
  • bpo-30919: shared memory allocation performance regression in multiprocessing #2708
  • Files
  • shared_array_alloc.py: Simple timeit call to allocate 4GB memory
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-07-23.11:05:43.640>
    created_at = <Date 2017-07-13.14:03:22.072>
    labels = ['3.7', 'performance']
    title = 'Shared Array Memory Allocation Regression'
    updated_at = <Date 2017-07-23.11:05:43.639>
    user = 'https://github.com/DTasev'

    bugs.python.org fields:

    activity = <Date 2017-07-23.11:05:43.639>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-07-23.11:05:43.640>
    closer = 'pitrou'
    components = []
    creation = <Date 2017-07-13.14:03:22.072>
    creator = 'dtasev'
    dependencies = []
    files = ['47009']
    hgrepos = []
    issue_num = 30919
    keywords = []
    message_count = 22.0
    messages = ['298285', '298289', '298291', '298294', '298305', '298343', '298344', '298345', '298346', '298348', '298352', '298354', '298367', '298374', '298424', '298425', '298429', '298430', '298431', '298432', '298549', '298897']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'sbt', 'gdr@garethrees.org', 'serhiy.storchaka', 'davin', 'dtasev']
    pr_nums = ['2708']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue30919'
    versions = ['Python 3.7']

    @DTasev
    Copy link
    Mannequin Author

    DTasev mannequin commented Jul 13, 2017

    Hello, I have noticed a significant performance regression when allocating a large shared array in Python 3.x versus Python 2.7. The affected module seems to be multiprocessing.

    The function I used for benchmarking:

    from timeit import timeit
    timeit('sharedctypes.Array(ctypes.c_float, 500*2048*2048)', 'from multiprocessing import sharedctypes; import ctypes', number=1)

    And the results from executing it:

    Python 3.5.2
    Out[2]: 182.68500420999771

    -------------------

    Python 2.7.12
    Out[6]: 2.124835968017578

    I will try to provide any information you need. Right now I am looking at callgrind/cachegrind without Debug symbols, and can post that, in the meantime I am building Python with Debug and will re-run the callgrind/cachegrind.

    Allocating the same-size array with numpy doesn't seem to have a difference between Python versions. The numpy command used was numpy.full((500,2048,2048), 5.0). Allocating the same number of list members also doesn't have a difference - arr = [5.0]*(500*2048*2048)

    @DTasev DTasev mannequin added the performance Performance or resource usage label Jul 13, 2017
    @serhiy-storchaka
    Copy link
    Member

    This is because instantiating new Arena is much slower in Python 3.

    ./python -m timeit -r1 -s 'from multiprocessing.heap import Arena' 'Arena(2**26)'

    Python 3.7: 1 loop, best of 1: 1.18 sec per loop
    Python 2.7: 100000 loops, best of 1: 9.19 usec per loop

    May be the regression was introduced by bpo-8713.

    @serhiy-storchaka
    Copy link
    Member

    It can be significantly sped up if replace the writing loop with

        if size:
            os.lseek(self.fd, size, 0)
            os.write(self.fd, b'')
            os.lseek(self.fd, 0, 0)

    or just with

        os.truncate(self.fd, size)

    (not sure the latter always works).

    But instantiating an open Arena will still be slower than in 2.7.

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jul 13, 2017

    In Python 2.7, multiprocessing.heap.Arena uses an anonymous memory mapping on Unix. Anonymous memory mappings can be shared between processes but only via fork().

    But Python 3 supports other ways of starting subprocesses (see bpo-8713 [1]) and so an anonymous memory mapping no longer works. So instead a temporary file is created, filled with zeros to the given size, and mapped into memory (see changeset 3b82e0d83bf9 [2]). It is the zero-filling of the temporary file that takes the time, because this forces the operating system to allocate space on the disk.

    But why not use ftruncate() (instead of write()) to quickly create a file with holes? POSIX says [3], "If the file size is increased, the extended area shall appear as if it were zero-filled" which would seem to satisfy the requirement.

    [1] https://bugs.python.org/issue8713
    [2] https://hg.python.org/cpython/rev/3b82e0d83bf9
    [3] http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jul 13, 2017

    Note that some filesystems (e.g. HFS+) don't support sparse files, so creating a large Arena will still be slow on these filesystems even if the file is created using ftruncate().

    (This could be fixed, for the "fork" start method only, by using anonymous maps in that case.)

    @DTasev
    Copy link
    Mannequin Author

    DTasev mannequin commented Jul 14, 2017

    If I understand correctly, there is no way to force the old behaviour in Python 3.5, i.e. to use an anonymous memory mapping in multiprocessing.heap.Arena so that the memory can be shared between the processes instead of writing to a shared file?

    The data sizes usually are, on average, 20 GB, and writing it out to a file is not desirable. As I understand from Gareth Rees' comment, ftruncate() will speed up initialisation, however any processing afterwards would be IO capped.

    To shed more light to the processing going on, the data is handled as a 3D array, so each process gets a 2D array to operate on, and no information needs to be shared between processes.

    If the anonymous memory mapping cannot be forced, then the multiprocessing module with a shared array becomes unusable for me. Are you aware of any way to use the multiprocessing module to run execution in parallel, that somehow doesn't use a shared array?

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jul 14, 2017

    If you need the 2.7 behaviour (anonymous mappings) in 3.5 then you can still do it, with some effort. I think the approach that requires the smallest amount of work would be to ensure that subprocesses are started using fork(), by calling multiprocessing.set_start_method('fork'), and then monkey-patch multiprocessing.heap.Arena.__init__ so that it creates anonymous mappings using mmap.mmap(-1, size).

    (I suggested above that Python could be modified to create anonymous mappings in the 'fork' case, but now that I look at the code in detail, I see that it would be tricky, because the Arena class has no idea about the Context in which it is going to be used -- at the moment you can create one shared object and then pass it to subprocesses under different Contexts, so the shared objects have to support the lowest common denominator.)

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jul 14, 2017

    Nonetheless this is bound to be a nasty performance for many people doing big data processing with NumPy/SciPy/Pandas and multiprocessing and moving from 2 to 3, so even if it can't be fixed, the documentation ought to warn about the problem and explain how to work around it.

    @DTasev
    Copy link
    Mannequin Author

    DTasev mannequin commented Jul 14, 2017

    I have looked into your advice of changing multiprocessing.heap.Arena.__init__, I have removed the code that allocated the file and reverted to the old behaviour.

    I have done some brief runs and it seems to bring back the old behaviour which is allocating the space in RAM, rather than with IO. I am not sure what things this might break, and it might make the other usages of multiprocessing unstable!

    Can anyone think of anything this change might break? The Arena.__init__ code is the one from Python 2.7:

        class Arena(object):
            def __init__(self, size, fd=-1):
                 self.size = size
                 self.fd = fd  # still kept but is not used !
                 self.buffer = mmap.mmap(-1, self.size)

    There does not seem to be a difference regardless of the start method setting multiprocessing.set_start_method('fork') to be 'fork'.

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jul 14, 2017

    I see now that the default start method is 'fork' (except on Windows), so calling set_start_method is unnecessary.

    Note that you don't have to edit multiprocessing/heap.py, you can "monkey-patch" it in the program that needs the anonymous mapping:

        from multiprocessing.heap import Arena
    
        def anonymous_arena_init(self, size, fd=-1):
            "Create Arena using an anonymous memory mapping."
            self.size = size
            self.fd = fd  # still kept but is not used !
            self.buffer = mmap.mmap(-1, self.size)
    
        Arena.__init__ = anonymous_arena_init

    As for what it will break — any code that uses the 'spawn' or 'forkserver' start methods.

    @DTasev
    Copy link
    Mannequin Author

    DTasev mannequin commented Jul 14, 2017

    Thank you, that is indeed the solution I ended up with, along with a large explanation of why it was necessary.

    Do you think that it's worth updating the multiprocessing documentation to make users aware of that behaviour?

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jul 14, 2017

    I propose:

    1. Ask Richard Oudkerk why in changeset 3b82e0d83bf9 the temporary file is zero-filled and not truncated. Perhaps there's some file system where this is necessary? (I tested HFS+ which doesn't support sparse files, and zero-filling seems not to be necessary, but maybe there's some other file system where it is?)

    2. If there's no good reason for zero-filling the temporary file, replace it with a call to os.ftruncate(fd, size).

    3. Update the documentation to mention the performance issue when porting multiprocessing code from 2 to 3. Unfortunately, I don't think there's any advice that the documentation can give that will help work around it -- monkey-patching works but is not supported.

    4. Consider writing a fix, or at least a supported workaround. Here's a suggestion: update multiprocessing.sharedctypes and multiprocessing.heap so that they use anonymous maps in the 'fork' context. The idea is to update the RawArray and RawValue functions so that they take the context, and then pass the context down to _new_value, BufferWrapper.__init__ and thence to Heap.malloc where it can be used to determine what kind of Arena (file-backed or anonymous) should be used to satisfy the allocation request. The Heap class would have to have to segregate its blocks according to what kind of Arena they come from.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 14, 2017

    The original changeset in Richard's repository is https://hg.python.org/sandbox/sbt/rev/6c8554a7d068. Unless Richard answers otherwise, I think it's likely the performance degradation was an oversight.

    Given the code we're talking about is POSIX-specific, truncate() (or its sibling os.ftruncate()) is specified to always zero-fill. So it should be safe to use.

    Another related point is that the arena file is created inside get_temp_dir(), which allocates something in /tmp. On modern Linux systems at least, /tmp is usually backed by persistent storage. Instead, we could use /run/user/{os.getuid()}, which uses a tmpfs and would therefore save on I/O when later accessing the shared array.

    @pitrou pitrou added the 3.7 (EOL) end of life label Jul 14, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Jul 14, 2017

    Gareth, Dimitar, you may want to take a look at #2708

    @serhiy-storchaka
    Copy link
    Member

    The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer). get_temp_dir() will be successful, but this may be not enough for allocating large shared array. And it is harder to control by user, because the location is not specified by an environment variable.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 16, 2017

    The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer)

    Hmm, you're right. Is /dev/shm ok on your computer?

    @serhiy-storchaka
    Copy link
    Member

    Yes, /dev/shm is ok on my computer.

    But there is yet one drawback of this patch. Currently the size of shared array is limited by the free space on the file system for temporary files. If it isn't enough you can easy set TMPDIR to other path. In 2.7 and with hardcoded /dev/shm it is limited by the size of physical memory (maybe plus swap). It is not easy to increase it, especially for unprivileged user.

    The effect of hardcoding /dev/shm can be achieved by setting TMPDIR to /dev/shm. Using os.ftruncate() adds yet about 20% of speed up. This looks mainly as a documentation issue to me. The regression and the workaround should be documented.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 16, 2017

    The main issue here is to restore performance of 2.7 version. Setting TMPDIR sounds too general and might have impact on other libraries. I think it's ok if shared array size is limited by virtual memory size on the computer, since the whole point is to expose a fast communication channel between processes.

    @serhiy-storchaka
    Copy link
    Member

    This can break Python 3 applications that work with shared arrays larger than the size of physical memory. If do this change, it should be optional.

    Setting TMPDIR=/dev/shm restores performance of 2.7 version even on versions that we will decide to not patch.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 16, 2017

    Again, setting TMPDIR is a poor workaround as it will impact any library creating temporary files.

    I haven't found anyone complaining about limited shared array size on 2.7, so likely it's not a common concern. Right now, anyone creating a huge shared array on 3.x will be facing huge performance issues, as the array is backed by disk, so it doesn't sound realistic at all to assume people are doing that.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 17, 2017

    Serhiy, I've changed the strategy in PR 2708: now the partition free space is tested before deciding to locate the file there or not.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 23, 2017

    New changeset 3051f0b by Antoine Pitrou in branch 'master':
    bpo-30919: shared memory allocation performance regression in multiprocessing (bpo-2708)
    3051f0b

    @pitrou pitrou closed this as completed Jul 23, 2017
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants