Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce multiprocessing.Queue contention #61227

Closed
neologix mannequin opened this issue Jan 24, 2013 · 20 comments
Closed

reduce multiprocessing.Queue contention #61227

neologix mannequin opened this issue Jan 24, 2013 · 20 comments
Labels
performance Performance or resource usage

Comments

@neologix
Copy link
Mannequin

neologix mannequin commented Jan 24, 2013

BPO 17025
Nosy @pitrou
Files
  • queues_contention.diff
  • multi_queue.py
  • locked_send_recv.patch
  • queues_contention-1.diff
  • queues_contention-3.diff
  • queues_contention.diff
  • forkingpickler.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-03-25.17:43:58.104>
    created_at = <Date 2013-01-24.15:49:06.682>
    labels = ['performance']
    title = 'reduce multiprocessing.Queue contention'
    updated_at = <Date 2013-03-25.17:43:58.102>
    user = 'https://bugs.python.org/neologix'

    bugs.python.org fields:

    activity = <Date 2013-03-25.17:43:58.102>
    actor = 'neologix'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-03-25.17:43:58.104>
    closer = 'neologix'
    components = []
    creation = <Date 2013-01-24.15:49:06.682>
    creator = 'neologix'
    dependencies = []
    files = ['28812', '28813', '28873', '29528', '29529', '29559', '29560']
    hgrepos = []
    issue_num = 17025
    keywords = ['patch']
    message_count = 20.0
    messages = ['180532', '180533', '180781', '180782', '181266', '183486', '183487', '183493', '183494', '183500', '183523', '184871', '184875', '185128', '185130', '185131', '185136', '185139', '185214', '185217']
    nosy_count = 4.0
    nosy_names = ['pitrou', 'neologix', 'python-dev', 'sbt']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue17025'
    versions = ['Python 3.4']

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 24, 2013

    Here's an implementation of the idea posted on python-ideas (http://mail.python.org/pipermail/python-ideas/2013-January/018846.html).

    The principle is really simple, we just serialize/unserialize the objects before/after holding the locks. This leads to reduced contention.

    Here are the results of a benchmark using from 1 reader/1 writer up to 4 readers/4 writers, on a 8-cores box:

    without patch:
    $ ./python /tmp/multi_queue.py
    took 0.8340198993682861 seconds with 1 workers
    took 1.956531047821045 seconds with 2 workers
    took 3.175778865814209 seconds with 3 workers
    took 4.277260780334473 seconds with 4 workers

    with patch:
    $ ./python /tmp/multi_queue.py
    took 0.7945001125335693 seconds with 1 workers
    took 0.7428359985351562 seconds with 2 workers
    took 0.7897098064422607 seconds with 3 workers
    took 1.1860828399658203 seconds with 4 workers

    I tried Richard's suggestion of serializing the data inside put(), but this reduces performance quite notably:
    $ ./python /tmp/multi_queue.py
    took 1.412883996963501 seconds with 1 workers
    took 1.3212130069732666 seconds with 2 workers
    took 1.2271699905395508 seconds with 3 workers
    took 1.4817359447479248 seconds with 4 workers

    Although I didn't analyse it further, I guess one reason could be that if the serializing is done in put(), the feeder thread has nothing to do but keep waiting for data to be available from the buffer, send it, and block until there's more to do: basically, it almost doesn't use its time-slice, and spends its time blocking and doing context switches.

    @neologix neologix mannequin added the performance Performance or resource usage label Jan 24, 2013
    @pitrou
    Copy link
    Member

    pitrou commented Jan 24, 2013

    Here's an implementation of the idea posted on python-ideas
    (http://mail.python.org/pipermail/python-ideas/2013-January/018846.html).

    The principle is really simple, we just serialize/unserialize the
    objects before/after holding the locks. This leads to reduced
    contention.

    I would like to suggest again my idea of doing it in Connection instead,
    with new methods (e.g. locked_send and locked_recv). Especially given
    it can be useful in user code to have a thread-safe Connection (I'm in
    this situation currently).

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 27, 2013

    I would like to suggest again my idea of doing it in Connection instead,
    with new methods (e.g. locked_send and locked_recv). Especially given
    it can be useful in user code to have a thread-safe Connection (I'm in
    this situation currently).

    I intended to do this initially, but then it turned out to be much
    more intrusive than what I initially thought, and opted for a simpler
    approach.

    While it's probably a good idea to implement it in Connection, I don't
    really like the idea of adding new distinct methods:

    • this would require allocating locks for every connection, which
      wouldn't be used most of the time
    • since locks are implemented atop POSIX semaphores and some platforms
      only support a handful of them, it could trigger some failure
    • it's not really just adding locked_send() and locked_recv(): you
      must implemented locked
      send_bytes/send/recv_bytes/recv_bytes_into/recv: also, if we want to
      implement timed and non blocking receive (which is supported by
      Queue.get), it should probably be handled here

    So basically, I think the right way to do it would be to define an
    abstract ConcurrentConnection, or AtomicConnection which would
    override Connection methods making them thread/process safe.
    We'd also need exposing AtomicPipe...

    @pitrou
    Copy link
    Member

    pitrou commented Jan 27, 2013

    For the record, I tried the Connection approach and here is what I ended up with.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Feb 3, 2013

    For the record, I tried the Connection approach and here is what I ended up with.

    I don't really like the API.
    Having to pass an external lock is IMO a bad idea, it should be a
    private instance field.
    Also, for consistency we'd probably need send_bytes/recv_bytes.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 4, 2013

    So, what do you think?
    Is the simple version offloading the serialization to queue enough, or
    should we go for a full-blown atomic Connection/Pipe/etc?

    I find the performance gain quite appreciable (basically queue didn't
    scale at all, now it scales with the number of cores).

    @pitrou
    Copy link
    Member

    pitrou commented Mar 4, 2013

    IMHO the simple version is good enough.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 4, 2013

    It looks like queues_contention.diff has the line

        obj = pickle.dumps(obj)

    in both _feed() and put(). Might that be why the third set of benchmarks was slower than the second?

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 4, 2013

    It looks like queues_contention.diff has the line

    obj = pickle.dumps(obj)
    

    in both _feed() and put(). Might that be why the third set of benchmarks
    was slower than the second?

    _feed() is a Queue method, put() its SimpleQueue() counterpart. Am I
    missing something?

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 4, 2013

    On 04/03/2013 8:01pm, Charles-François Natali wrote:

    > It looks like queues_contention.diff has the line
    >
    > obj = pickle.dumps(obj)
    >
    > in both _feed() and put(). Might that be why the third set of benchmarks
    > was slower than the second?

    _feed() is a Queue method, put() its SimpleQueue() counterpart. Am I
    missing something?

    No. I only looked at the diff and assumed both changes were for Queue.

    Since you marked bpo-10886 as superceded, do you intend to do the
    pickling in put()?

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 5, 2013

    No. I only looked at the diff and assumed both changes were for Queue.

    OK, great.

    Since you marked bpo-10886 as superceded, do you intend to do the
    pickling in put()?

    Actually no, I'll reopen it.
    I find the performance hit important, though.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 21, 2013

    By the way, I forgot to mention it previously, but multiprocessing.connection uses a custom pickler (ForkingPickler). By replacing it with plain pickle.dumps() calls, you may produce regressions since some types won't be sendable anymore (although the test suite might not check for this).

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 21, 2013

    By the way, I forgot to mention it previously, but
    multiprocessing.connection uses a custom pickler (ForkingPickler).

    Thanks, I didn't know.
    Here's a patch using ForkingPickler.

    I did a bit of refactoring to move the pickling code from
    connection.py to forking.py.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 24, 2013

    I'm splitting the patches:

    • one which adds loads and dumps to ForkingPicler
    • the contention reduction patch

    I'd like to commit them soon.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 24, 2013

    The old code deleted the obj in the feeder thread as soon as it was sent at lines 247 and 253 -- see Issue bpo-16284. I think that should be retained.

    Apart from that LGTM.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 24, 2013

    The old code deleted the obj in the feeder thread as soon as it was sent at lines 247 and 253 -- see Issue bpo-16284. I think that should be retained.

    The object is overwritten by the pickled data, so it's not necessary
    anymore, no?

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Mar 24, 2013

    On 24/03/2013 12:16pm, Charles-François Natali wrote:

    The object is overwritten by the pickled data, so it's not necessary
    anymore, no?

    Yes, you are right.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 24, 2013

    New changeset bedb4cbdd311 by Charles-François Natali in branch 'default':
    Issue bpo-17025: Add dumps() and loads() to ForkingPickler.
    http://hg.python.org/cpython/rev/bedb4cbdd311

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 25, 2013

    New changeset 5022ee7e13a2 by Charles-François Natali in branch 'default':
    Issue bpo-17025: multiprocessing: Reduce Queue and SimpleQueue contention.
    http://hg.python.org/cpython/rev/5022ee7e13a2

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 25, 2013

    Committed, thanks!

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant