classification
Title: test_multiprocessing_xxx hangs under Gentoo buildbots
Type: crash Stage:
Components: Interpreter Core, Library (Lib), Tests Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, benjamin.peterson, christian.heimes, georg.brandl, haypo, larry, nadeem.vawda, neologix, pitrou, python-dev, r.david.murray, sbt
Priority: release blocker Keywords: buildbot, patch

Created on 2013-10-11 18:51 by pitrou, last changed 2013-11-17 08:19 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
unlock_ssl_locks.patch christian.heimes, 2013-10-20 18:08
atfork_prepare.patch christian.heimes, 2013-10-26 23:05 review
Messages (30)
msg199497 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-11 18:51
test_multiprocessing has started to hang frequently on the Gentoo buildbots.

It seems it has started happening with the "OpenSSL re-seeding" commits:
http://hg.python.org/cpython/rev/8e1194c39beddb83337c0acb9e4c2922a02a36cf

David, could you try to investigate on the buildbot? If this security fix can crash some machines, we'd better find a workaround.
msg199509 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-11 21:41
Running test_socket test_ssl test_multiprocessing_fork test_multiprocessing_forkserver test_multiprocessing_spaw under -F on the buildbot, I got the following failure during the second loop:

[ 10] test_multiprocessing_spawn
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 96, in spawn_main
    exitcode = _main(fd)
  File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 105, in _main
    prepare(preparation_data)
  File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 210, in prepare
    import_main_path(data['main_path'])
  File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 256, in import_main_path
    exec(code, main_module.__dict__)
  File "/home/rdmurray/p34/Lib/test/regrtest.py", line 136, in <module>
    import shutil
  File "/home/rdmurray/p34/Lib/shutil.py", line 14, in <module>
    import tarfile
  File "<frozen importlib._bootstrap>", line 1599, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1566, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 607, in _check_name_wrapper
  File "<frozen importlib._bootstrap>", line 1056, in load_module
  File "<frozen importlib._bootstrap>", line 922, in load_module
  File "<frozen importlib._bootstrap>", line 1025, in get_code
  File "<frozen importlib._bootstrap>", line 1065, in get_data
OSError: [Errno 23] Too many open files in system: '/home/rdmurray/p34/Lib/tarfile.py'
Process Process-1410:
Traceback (most recent call last):
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 255, in _bootstrap
    self.run()
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 92, in run
    self._target(*self._args, **self._kwargs)
Process Process-1411:
Traceback (most recent call last):
Process Process-1412:
Traceback (most recent call last):
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 255, in _bootstrap
    self.run()
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 92, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1135, in task
    self.f(*self.args)
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f
    res = barrier.wait()
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 255, in _bootstrap
    self.run()
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 92, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait
    self._wait(timeout)
  File "/home/rdmurray/p34/Lib/threading.py", line 435, in _wait
    raise BrokenBarrierError
threading.BrokenBarrierError
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1135, in task
    self.f(*self.args)
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f
    res = barrier.wait()
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1135, in task
    self.f(*self.args)
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f
    res = barrier.wait()
  File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait
    self._wait(timeout)
  File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait
    self._wait(timeout)
  File "/home/rdmurray/p34/Lib/threading.py", line 435, in _wait
    raise BrokenBarrierError
threading.BrokenBarrierError
  File "/home/rdmurray/p34/Lib/threading.py", line 435, in _wait
    raise BrokenBarrierError
threading.BrokenBarrierError
test test_multiprocessing_spawn failed -- Traceback (most recent call last):
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1227, in test_wait_return
    self.run_threads(self._test_wait_return_f, (self.barrier, queue))
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1184, in run_threads
    f(*args)
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f
    res = barrier.wait()
  File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait
    self._wait(timeout)
  File "/home/rdmurray/p34/Lib/threading.py", line 433, in _wait
    raise BrokenBarrierError
threading.BrokenBarrierError
msg199529 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-12 01:23
Here is a more useful traceback:

[394] test_multiprocessing_spawn
Timeout (1:00:00)!
Thread 0xb76a16c0:
  File "/home/rdmurray/p34/Lib/multiprocessing/popen_fork.py", line 30 in poll
  File "/home/rdmurray/p34/Lib/multiprocessing/popen_fork.py", line 54 in wait
  File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 124 in join
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 144 in __call__
  File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 302 in test_terminate
  File "/home/rdmurray/p34/Lib/unittest/case.py", line 571 in run
  File "/home/rdmurray/p34/Lib/unittest/case.py", line 610 in __call__
  File "/home/rdmurray/p34/Lib/unittest/suite.py", line 117 in run
  File "/home/rdmurray/p34/Lib/unittest/suite.py", line 79 in __call__
  File "/home/rdmurray/p34/Lib/unittest/suite.py", line 117 in run
  File "/home/rdmurray/p34/Lib/unittest/suite.py", line 79 in __call__
  File "/home/rdmurray/p34/Lib/unittest/suite.py", line 117 in run
  File "/home/rdmurray/p34/Lib/unittest/suite.py", line 79 in __call__
  File "/home/rdmurray/p34/Lib/unittest/runner.py", line 168 in run
  File "/home/rdmurray/p34/Lib/test/support/__init__.py", line 1661 in _run_suite
  File "/home/rdmurray/p34/Lib/test/support/__init__.py", line 1695 in run_unittest
  File "/home/rdmurray/p34/Lib/test/regrtest.py", line 1275 in <lambda>
  File "/home/rdmurray/p34/Lib/test/regrtest.py", line 1276 in runtest_inner
  File "/home/rdmurray/p34/Lib/test/regrtest.py", line 965 in runtest
  File "/home/rdmurray/p34/Lib/test/regrtest.py", line 761 in main
  File "/home/rdmurray/p34/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
  File "/home/rdmurray/p34/Lib/test/__main__.py", line 3 in <module>
  File "/home/rdmurray/p34/Lib/runpy.py", line 73 in _run_code
  File "/home/rdmurray/p34/Lib/runpy.py", line 160 in _run_module_as_main
msg199533 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-10-12 01:30
See also issue #12413 which proposes to dump also the Python traceback of child processes to analyze multiprocessing issues.
msg199534 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-12 01:34
That's great, but without a patch it doesn't help me debug *this* failure :)

Any suggestions for getting more information out of that test (I'm not familiar with multiprocessing...I'm already confused by the fact that the test is named test_multiprocessing_spawn and the error is coming from a module named popen_fork...)
msg199539 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-12 05:55
> R. David Murray added the comment:

OSError: [Errno 23] Too many open files in system:
'/home/rdmurray/p34/Lib/tarfile.py'

Well, the buildbot is hitting ENFILE.
Try to run "lsof" as root, to see which process (more likely
processes) have so many open files.
Also, you should check the value of "/proc/sys/fs/file-max", although
the default value should be more than enough.
msg199540 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-12 05:58
> Here is a more useful traceback:

If the failures aren't linked to ENFILE, then you could use strace to
find the process on which the test is doing a waitpid(), and then
perform an strace and gdb on that process to see where it's stuck.
And send it a fatal signal that will make faulthander dump the stack.
msg199546 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-12 10:56
> I'm already confused by the fact that the test is named 
> test_multiprocessing_spawn and the error is coming from a module named 
> popen_fork...)

popen_spawn_posix.Popen is a subclass of popen_fork.Popen.
msg199643 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-12 23:55
New changeset 87f484679a39 by Richard Oudkerk in branch '3.3':
Make test_terminate() succeed or fail quickly.
http://hg.python.org/cpython/rev/87f484679a39
msg200131 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-17 14:52
I can reproduce the problem on the Non-Debug Gentoo buildbot using only os.fork() and os.kill(pid, signal.SIGTERM).  See

    http://hg.python.org/cpython/file/9853d3a20849/Lib/test/_test_multiprocessing.py#l339

To investigate further I think strace and/or gdb will need to be installed on that box.

P.S. Note that the Debug Gentoo buildbot is always failing at the configure stage with "No space left on device".
msg200143 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-17 16:18
I fixed the out of space last night. (Someday I'll get around to figuring out which test it is that is leaving a bunch of data around when it fails, but I haven't yet).

I've installed strace and gdb on the bots, please send me your public key and I'll set up an ssh login for you.
msg200147 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-17 17:34
> I fixed the out of space last night. (Someday I'll get around to figuring 
> out which test it is that is leaving a bunch of data around when it fails, 
> but I haven't yet).

It looks like on the Debug Gentoo buildbot configure and clean are failing.

  http://buildbot.python.org/all/builders/x86%20Gentoo%203.x/builds/5090/steps/configure/logs/stdio
  http://buildbot.python.org/all/builders/x86%20Gentoo%203.x/builds/5090/steps/clean/logs/stdio

> I've installed strace and gdb on the bots, please send me your public key 
> and I'll set up an ssh login for you.

Thanks.  For now I will just try starting gdb using subprocess on the custom buildbot.
msg200152 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-17 18:02
Hmm.  Looks like the name of the temp files used by the failing test (which is in bsddb, by the way) have changed, so my delete command didn't delete them.  So now /tmp is definitely clean, and the next build on that bot should work (crosses fingers).
msg200166 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-17 20:42
I finally have a gdb backtrace of a stuck child (started using os.fork() not multiprocessing):

#1  0xb76194da in ?? () from /lib/libc.so.6
#2  0xb6d59755 in ?? ()
   from /var/lib/buildslave/custom.murray-gentoo/build/build/lib.linux-i686-3.4-pydebug/_ssl.cpython-34dm.so
#3  0xb6d628f0 in _fini ()
   from /var/lib/buildslave/custom.murray-gentoo/build/build/lib.linux-i686-3.4-pydebug/_ssl.cpython-34dm.so
#4  0xb770859b in ?? () from /lib/ld-linux.so.2
#5  0xb75502c7 in ?? () from /lib/libc.so.6
#6  0xb7550330 in exit () from /lib/libc.so.6
#7  0xb558f244 in ?? () from /lib/libncursesw.so.5
#8  0xb76e9f38 in fork () from /lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---#9  0x08085f89 in posix_fork (self=0xb74da374, noargs=0x0)
    at ./Modules/posixmodule.c:5315
...

It looks as though fork() is indirectly calling something in _ssl.cpython-34dm.so which is not completing.  So I guess this is pthread_atfork() related.  But the child argument passed to pthread_atfork() should be NULL, so I don't really understand this:

static int
PySSL_RAND_atfork(void)
{
    static int registered = 0;
    int retval;

    if (registered)
        return 0;

    retval = pthread_atfork(NULL,                     /* prepare */
                            PySSL_RAND_atfork_parent, /* parent */
                            NULL);                    /* child */
    if (retval != 0) {
        PyErr_SetFromErrno(PyExc_OSError);
        return -1;
    }
    registered = 1;
    return 0;
}
msg200176 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-17 23:54
Actually, according to strace the call which blocks is

    futex(0xb7839454, FUTEX_WAIT_PRIVATE, 1, NULL
msg200414 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-19 12:57
Does 2.7 also hang? The pthread_atfork() handler is also installed in 2.7.

Can you install the debugging symbols for libc and openssl on your machine?
msg200415 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 13:07
> It looks as though fork() is indirectly calling something in _ssl.cpython-34dm.so which is not
> completing.

The stack is strange: what is /lib/libncursesw.so.5 doing there?
Also, we can see exit() is being called. And indeed, _fini() is some
shared libraary deallocation code...
msg200601 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-20 18:08
I think it's more likely that my patch is triggering an existing bug. The locking code for the SSL module and OpenSSL doesn't release locks on fork. I have attached an experimental patch that unlocks all locks in the client.

Please try if it resolves the issue.
msg200771 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-21 12:50
> Christian Heimes added the comment:
>
> I think it's more likely that my patch is triggering an existing bug. The locking code for the SSL module and OpenSSL doesn't release locks on fork. I have attached an experimental patch that unlocks all locks in the client.

The problem is that this patch can make it even more deadlock-prone:
pthread_mutex_unlock() isn't async-signal safe (it's not just theory:
depending on the implementation, I've seen it deadlock on some
platforms, even on Linux IIRC).

Also, there's another problem with the approach of "unlocking locks in
the child process": locks are usually used to protect critical
sections, i.e. sections of code where an invariant must be preserved
between entry and exit: by merely unlocking the locks in the child
process, we can - maybe, see above - prevent deadlocks, but the data
protected by those locks can very-well be left in an inconsistent
state. So we might be trading a deadlock for a non-functional ssl
library in the child process, which would be - from a security POV -
even worse.

Really, adding fork-hooks is a bad idea: unfortunately, I don't have
anything better to suggest :-(.

Except maybe adding an add-hoc hook to posix_fork(), like the ones
used to reset the GIL, threading module etc: at least, it will not
make the interpreter less safe than now, and won't be able to deadlock
processes that are conform and use subprocess to create processes...
msg201368 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-26 15:42
I still don't get it. All arguments are about "don't execute async unsafe code after fork()". But the child process does *not* execute *any* code after fork. It's just the parent process that executes its handler.
msg201371 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-10-26 16:45
Your patch adds a child process atfork handler, which definitely isn't executed in the parent....
msg201401 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-26 23:05
I wasn't talking about unlock_ssl_locks.patch but yet another patch.

Anyway neither Python 2.7.6 nor 3.3.3 should be released with the current pthread_atfork() handler. We still don't understand the cause of the hanging buildbot but evidence points towards the pthread_atfork() code.

I'd like to give the new patch a try. If it doesn't solve the issue ASAP then I'm going to disable the feature and replace it with a warning in the SSL module and os.fork() function. The issue isn't severe enough to break Python.
msg201402 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-26 23:12
Won't using a prepare handler mean that the parent and child processes will use the same seed until one or other of them forks again?
msg201403 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-26 23:16
OpenSSL kinda takes care of that. The PRNG mixes the PID into the seed so parent and child won't see the same output. The patch fixes a slightly more tricky issue:

- parent initializes PRNG but never touches the PRNG again
- parent forks lots of child processes
- PIDs get reused -- two child processes with same PID get same pseudorandom data
msg201432 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-10-27 06:55
OK, I'm going to apply atfork_prepare.patch for 3.3.3rc1.  If there is any indication of a continuing problem, we'll revert the whole re-seeding code for the final.
msg201433 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-27 06:57
New changeset 021ca321b26e by Georg Brandl in branch '3.3':
Issue #19227: Try to fix deadlocks caused by re-seeding then OpenSSL
http://hg.python.org/cpython/rev/021ca321b26e

New changeset d3a13a7be9e9 by Georg Brandl in branch 'default':
#19227: merge with 3.3
http://hg.python.org/cpython/rev/d3a13a7be9e9
msg201436 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-27 08:57
Note that no matter where the atfork-hook is executed (prepare,
parent, child), it will still be deadlock-prone in case of
double-fork, since the hook will effectively be called after the first
fork (and before the second one). And double-fork is common, e.g. when
creating daemon processes.
msg201662 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-29 20:18
New changeset 5942eea8cf41 by Christian Heimes in branch '3.3':
Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding
http://hg.python.org/cpython/rev/5942eea8cf41

New changeset 705f2addd0f0 by Christian Heimes in branch 'default':
Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding
http://hg.python.org/cpython/rev/705f2addd0f0

New changeset 4d761ce0ac74 by Christian Heimes in branch '2.7':
Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding
http://hg.python.org/cpython/rev/4d761ce0ac74
msg201665 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-29 20:24
New changeset ad779da9e351 by Christian Heimes in branch '2.7':
Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding
http://hg.python.org/cpython/rev/ad779da9e351
msg203128 - (view) Author: Roundup Robot (python-dev) Date: 2013-11-17 08:19
New changeset 4e6aa98bb11c by Christian Heimes in branch '3.3':
Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding
http://hg.python.org/cpython/rev/4e6aa98bb11c
History
Date User Action Args
2013-11-17 08:19:33python-devsetmessages: + msg203128
2013-11-09 19:59:39benjamin.petersonsetstatus: open -> closed
resolution: fixed
2013-10-29 20:24:54python-devsetmessages: + msg201665
2013-10-29 20:18:47python-devsetmessages: + msg201662
2013-10-28 21:21:12nadeem.vawdasetnosy: + nadeem.vawda
2013-10-27 08:57:12neologixsetmessages: + msg201436
2013-10-27 06:57:08python-devsetmessages: + msg201433
2013-10-27 06:55:51georg.brandlsetmessages: + msg201432
2013-10-27 06:41:09Arfreversetnosy: + Arfrever
2013-10-27 05:59:44georg.brandllinkissue18747 dependencies
2013-10-26 23:16:03christian.heimessetmessages: + msg201403
2013-10-26 23:12:36sbtsetmessages: + msg201402
2013-10-26 23:05:55christian.heimessetfiles: + atfork_prepare.patch

messages: + msg201401
2013-10-26 16:45:58benjamin.petersonsetmessages: + msg201371
2013-10-26 15:42:17christian.heimessetmessages: + msg201368
2013-10-21 12:50:45neologixsetmessages: + msg200771
2013-10-20 18:08:58christian.heimessetfiles: + unlock_ssl_locks.patch
keywords: + patch
messages: + msg200601
2013-10-19 13:07:56neologixsetmessages: + msg200415
2013-10-19 12:57:08christian.heimessetpriority: critical -> release blocker
versions: + Python 2.7
nosy: + benjamin.peterson, larry, georg.brandl

messages: + msg200414
2013-10-17 23:54:00sbtsetmessages: + msg200176
2013-10-17 20:42:53sbtsetmessages: + msg200166
2013-10-17 18:02:14r.david.murraysetmessages: + msg200152
2013-10-17 17:34:03sbtsetmessages: + msg200147
2013-10-17 16:18:24r.david.murraysetmessages: + msg200143
2013-10-17 14:52:09sbtsetmessages: + msg200131
2013-10-12 23:55:08python-devsetnosy: + python-dev
messages: + msg199643
2013-10-12 10:56:32sbtsetmessages: + msg199546
2013-10-12 05:58:50neologixsetmessages: + msg199540
2013-10-12 05:55:09neologixsetmessages: + msg199539
2013-10-12 01:34:41r.david.murraysetmessages: + msg199534
2013-10-12 01:30:48hayposetnosy: + neologix
messages: + msg199533
2013-10-12 01:23:48r.david.murraysetmessages: + msg199529
2013-10-11 21:41:37r.david.murraysetmessages: + msg199509
2013-10-11 21:02:39hayposetnosy: + haypo
2013-10-11 18:51:12pitroucreate