Issue19227
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013-10-11 18:51 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
unlock_ssl_locks.patch | christian.heimes, 2013-10-20 18:08 | |||
atfork_prepare.patch | christian.heimes, 2013-10-26 23:05 | review |
Messages (30) | |||
---|---|---|---|
msg199497 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2013-10-11 18:51 | |
test_multiprocessing has started to hang frequently on the Gentoo buildbots. It seems it has started happening with the "OpenSSL re-seeding" commits: http://hg.python.org/cpython/rev/8e1194c39beddb83337c0acb9e4c2922a02a36cf David, could you try to investigate on the buildbot? If this security fix can crash some machines, we'd better find a workaround. |
|||
msg199509 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-10-11 21:41 | |
Running test_socket test_ssl test_multiprocessing_fork test_multiprocessing_forkserver test_multiprocessing_spaw under -F on the buildbot, I got the following failure during the second loop: [ 10] test_multiprocessing_spawn Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 96, in spawn_main exitcode = _main(fd) File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 105, in _main prepare(preparation_data) File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 210, in prepare import_main_path(data['main_path']) File "/home/rdmurray/p34/Lib/multiprocessing/spawn.py", line 256, in import_main_path exec(code, main_module.__dict__) File "/home/rdmurray/p34/Lib/test/regrtest.py", line 136, in <module> import shutil File "/home/rdmurray/p34/Lib/shutil.py", line 14, in <module> import tarfile File "<frozen importlib._bootstrap>", line 1599, in _find_and_load File "<frozen importlib._bootstrap>", line 1566, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 607, in _check_name_wrapper File "<frozen importlib._bootstrap>", line 1056, in load_module File "<frozen importlib._bootstrap>", line 922, in load_module File "<frozen importlib._bootstrap>", line 1025, in get_code File "<frozen importlib._bootstrap>", line 1065, in get_data OSError: [Errno 23] Too many open files in system: '/home/rdmurray/p34/Lib/tarfile.py' Process Process-1410: Traceback (most recent call last): File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 255, in _bootstrap self.run() File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 92, in run self._target(*self._args, **self._kwargs) Process Process-1411: Traceback (most recent call last): Process Process-1412: Traceback (most recent call last): File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 255, in _bootstrap self.run() File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 92, in run self._target(*self._args, **self._kwargs) File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1135, in task self.f(*self.args) File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f res = barrier.wait() File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 255, in _bootstrap self.run() File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 92, in run self._target(*self._args, **self._kwargs) File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait self._wait(timeout) File "/home/rdmurray/p34/Lib/threading.py", line 435, in _wait raise BrokenBarrierError threading.BrokenBarrierError File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1135, in task self.f(*self.args) File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f res = barrier.wait() File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1135, in task self.f(*self.args) File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f res = barrier.wait() File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait self._wait(timeout) File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait self._wait(timeout) File "/home/rdmurray/p34/Lib/threading.py", line 435, in _wait raise BrokenBarrierError threading.BrokenBarrierError File "/home/rdmurray/p34/Lib/threading.py", line 435, in _wait raise BrokenBarrierError threading.BrokenBarrierError test test_multiprocessing_spawn failed -- Traceback (most recent call last): File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1227, in test_wait_return self.run_threads(self._test_wait_return_f, (self.barrier, queue)) File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1184, in run_threads f(*args) File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 1219, in _test_wait_return_f res = barrier.wait() File "/home/rdmurray/p34/Lib/threading.py", line 395, in wait self._wait(timeout) File "/home/rdmurray/p34/Lib/threading.py", line 433, in _wait raise BrokenBarrierError threading.BrokenBarrierError |
|||
msg199529 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-10-12 01:23 | |
Here is a more useful traceback: [394] test_multiprocessing_spawn Timeout (1:00:00)! Thread 0xb76a16c0: File "/home/rdmurray/p34/Lib/multiprocessing/popen_fork.py", line 30 in poll File "/home/rdmurray/p34/Lib/multiprocessing/popen_fork.py", line 54 in wait File "/home/rdmurray/p34/Lib/multiprocessing/process.py", line 124 in join File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 144 in __call__ File "/home/rdmurray/p34/Lib/test/_test_multiprocessing.py", line 302 in test_terminate File "/home/rdmurray/p34/Lib/unittest/case.py", line 571 in run File "/home/rdmurray/p34/Lib/unittest/case.py", line 610 in __call__ File "/home/rdmurray/p34/Lib/unittest/suite.py", line 117 in run File "/home/rdmurray/p34/Lib/unittest/suite.py", line 79 in __call__ File "/home/rdmurray/p34/Lib/unittest/suite.py", line 117 in run File "/home/rdmurray/p34/Lib/unittest/suite.py", line 79 in __call__ File "/home/rdmurray/p34/Lib/unittest/suite.py", line 117 in run File "/home/rdmurray/p34/Lib/unittest/suite.py", line 79 in __call__ File "/home/rdmurray/p34/Lib/unittest/runner.py", line 168 in run File "/home/rdmurray/p34/Lib/test/support/__init__.py", line 1661 in _run_suite File "/home/rdmurray/p34/Lib/test/support/__init__.py", line 1695 in run_unittest File "/home/rdmurray/p34/Lib/test/regrtest.py", line 1275 in <lambda> File "/home/rdmurray/p34/Lib/test/regrtest.py", line 1276 in runtest_inner File "/home/rdmurray/p34/Lib/test/regrtest.py", line 965 in runtest File "/home/rdmurray/p34/Lib/test/regrtest.py", line 761 in main File "/home/rdmurray/p34/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd File "/home/rdmurray/p34/Lib/test/__main__.py", line 3 in <module> File "/home/rdmurray/p34/Lib/runpy.py", line 73 in _run_code File "/home/rdmurray/p34/Lib/runpy.py", line 160 in _run_module_as_main |
|||
msg199533 - (view) | Author: STINNER Victor (vstinner) * | Date: 2013-10-12 01:30 | |
See also issue #12413 which proposes to dump also the Python traceback of child processes to analyze multiprocessing issues. |
|||
msg199534 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-10-12 01:34 | |
That's great, but without a patch it doesn't help me debug *this* failure :) Any suggestions for getting more information out of that test (I'm not familiar with multiprocessing...I'm already confused by the fact that the test is named test_multiprocessing_spawn and the error is coming from a module named popen_fork...) |
|||
msg199539 - (view) | Author: Charles-François Natali (neologix) * | Date: 2013-10-12 05:55 | |
> R. David Murray added the comment: OSError: [Errno 23] Too many open files in system: '/home/rdmurray/p34/Lib/tarfile.py' Well, the buildbot is hitting ENFILE. Try to run "lsof" as root, to see which process (more likely processes) have so many open files. Also, you should check the value of "/proc/sys/fs/file-max", although the default value should be more than enough. |
|||
msg199540 - (view) | Author: Charles-François Natali (neologix) * | Date: 2013-10-12 05:58 | |
> Here is a more useful traceback: If the failures aren't linked to ENFILE, then you could use strace to find the process on which the test is doing a waitpid(), and then perform an strace and gdb on that process to see where it's stuck. And send it a fatal signal that will make faulthander dump the stack. |
|||
msg199546 - (view) | Author: Richard Oudkerk (sbt) * | Date: 2013-10-12 10:56 | |
> I'm already confused by the fact that the test is named > test_multiprocessing_spawn and the error is coming from a module named > popen_fork...) popen_spawn_posix.Popen is a subclass of popen_fork.Popen. |
|||
msg199643 - (view) | Author: Roundup Robot (python-dev) | Date: 2013-10-12 23:55 | |
New changeset 87f484679a39 by Richard Oudkerk in branch '3.3': Make test_terminate() succeed or fail quickly. http://hg.python.org/cpython/rev/87f484679a39 |
|||
msg200131 - (view) | Author: Richard Oudkerk (sbt) * | Date: 2013-10-17 14:52 | |
I can reproduce the problem on the Non-Debug Gentoo buildbot using only os.fork() and os.kill(pid, signal.SIGTERM). See http://hg.python.org/cpython/file/9853d3a20849/Lib/test/_test_multiprocessing.py#l339 To investigate further I think strace and/or gdb will need to be installed on that box. P.S. Note that the Debug Gentoo buildbot is always failing at the configure stage with "No space left on device". |
|||
msg200143 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-10-17 16:18 | |
I fixed the out of space last night. (Someday I'll get around to figuring out which test it is that is leaving a bunch of data around when it fails, but I haven't yet). I've installed strace and gdb on the bots, please send me your public key and I'll set up an ssh login for you. |
|||
msg200147 - (view) | Author: Richard Oudkerk (sbt) * | Date: 2013-10-17 17:34 | |
> I fixed the out of space last night. (Someday I'll get around to figuring > out which test it is that is leaving a bunch of data around when it fails, > but I haven't yet). It looks like on the Debug Gentoo buildbot configure and clean are failing. http://buildbot.python.org/all/builders/x86%20Gentoo%203.x/builds/5090/steps/configure/logs/stdio http://buildbot.python.org/all/builders/x86%20Gentoo%203.x/builds/5090/steps/clean/logs/stdio > I've installed strace and gdb on the bots, please send me your public key > and I'll set up an ssh login for you. Thanks. For now I will just try starting gdb using subprocess on the custom buildbot. |
|||
msg200152 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2013-10-17 18:02 | |
Hmm. Looks like the name of the temp files used by the failing test (which is in bsddb, by the way) have changed, so my delete command didn't delete them. So now /tmp is definitely clean, and the next build on that bot should work (crosses fingers). |
|||
msg200166 - (view) | Author: Richard Oudkerk (sbt) * | Date: 2013-10-17 20:42 | |
I finally have a gdb backtrace of a stuck child (started using os.fork() not multiprocessing): #1 0xb76194da in ?? () from /lib/libc.so.6 #2 0xb6d59755 in ?? () from /var/lib/buildslave/custom.murray-gentoo/build/build/lib.linux-i686-3.4-pydebug/_ssl.cpython-34dm.so #3 0xb6d628f0 in _fini () from /var/lib/buildslave/custom.murray-gentoo/build/build/lib.linux-i686-3.4-pydebug/_ssl.cpython-34dm.so #4 0xb770859b in ?? () from /lib/ld-linux.so.2 #5 0xb75502c7 in ?? () from /lib/libc.so.6 #6 0xb7550330 in exit () from /lib/libc.so.6 #7 0xb558f244 in ?? () from /lib/libncursesw.so.5 #8 0xb76e9f38 in fork () from /lib/libpthread.so.0 ---Type <return> to continue, or q <return> to quit---#9 0x08085f89 in posix_fork (self=0xb74da374, noargs=0x0) at ./Modules/posixmodule.c:5315 ... It looks as though fork() is indirectly calling something in _ssl.cpython-34dm.so which is not completing. So I guess this is pthread_atfork() related. But the child argument passed to pthread_atfork() should be NULL, so I don't really understand this: static int PySSL_RAND_atfork(void) { static int registered = 0; int retval; if (registered) return 0; retval = pthread_atfork(NULL, /* prepare */ PySSL_RAND_atfork_parent, /* parent */ NULL); /* child */ if (retval != 0) { PyErr_SetFromErrno(PyExc_OSError); return -1; } registered = 1; return 0; } |
|||
msg200176 - (view) | Author: Richard Oudkerk (sbt) * | Date: 2013-10-17 23:54 | |
Actually, according to strace the call which blocks is futex(0xb7839454, FUTEX_WAIT_PRIVATE, 1, NULL |
|||
msg200414 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2013-10-19 12:57 | |
Does 2.7 also hang? The pthread_atfork() handler is also installed in 2.7. Can you install the debugging symbols for libc and openssl on your machine? |
|||
msg200415 - (view) | Author: Charles-François Natali (neologix) * | Date: 2013-10-19 13:07 | |
> It looks as though fork() is indirectly calling something in _ssl.cpython-34dm.so which is not > completing. The stack is strange: what is /lib/libncursesw.so.5 doing there? Also, we can see exit() is being called. And indeed, _fini() is some shared libraary deallocation code... |
|||
msg200601 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2013-10-20 18:08 | |
I think it's more likely that my patch is triggering an existing bug. The locking code for the SSL module and OpenSSL doesn't release locks on fork. I have attached an experimental patch that unlocks all locks in the client. Please try if it resolves the issue. |
|||
msg200771 - (view) | Author: Charles-François Natali (neologix) * | Date: 2013-10-21 12:50 | |
> Christian Heimes added the comment: > > I think it's more likely that my patch is triggering an existing bug. The locking code for the SSL module and OpenSSL doesn't release locks on fork. I have attached an experimental patch that unlocks all locks in the client. The problem is that this patch can make it even more deadlock-prone: pthread_mutex_unlock() isn't async-signal safe (it's not just theory: depending on the implementation, I've seen it deadlock on some platforms, even on Linux IIRC). Also, there's another problem with the approach of "unlocking locks in the child process": locks are usually used to protect critical sections, i.e. sections of code where an invariant must be preserved between entry and exit: by merely unlocking the locks in the child process, we can - maybe, see above - prevent deadlocks, but the data protected by those locks can very-well be left in an inconsistent state. So we might be trading a deadlock for a non-functional ssl library in the child process, which would be - from a security POV - even worse. Really, adding fork-hooks is a bad idea: unfortunately, I don't have anything better to suggest :-(. Except maybe adding an add-hoc hook to posix_fork(), like the ones used to reset the GIL, threading module etc: at least, it will not make the interpreter less safe than now, and won't be able to deadlock processes that are conform and use subprocess to create processes... |
|||
msg201368 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2013-10-26 15:42 | |
I still don't get it. All arguments are about "don't execute async unsafe code after fork()". But the child process does *not* execute *any* code after fork. It's just the parent process that executes its handler. |
|||
msg201371 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2013-10-26 16:45 | |
Your patch adds a child process atfork handler, which definitely isn't executed in the parent.... |
|||
msg201401 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2013-10-26 23:05 | |
I wasn't talking about unlock_ssl_locks.patch but yet another patch. Anyway neither Python 2.7.6 nor 3.3.3 should be released with the current pthread_atfork() handler. We still don't understand the cause of the hanging buildbot but evidence points towards the pthread_atfork() code. I'd like to give the new patch a try. If it doesn't solve the issue ASAP then I'm going to disable the feature and replace it with a warning in the SSL module and os.fork() function. The issue isn't severe enough to break Python. |
|||
msg201402 - (view) | Author: Richard Oudkerk (sbt) * | Date: 2013-10-26 23:12 | |
Won't using a prepare handler mean that the parent and child processes will use the same seed until one or other of them forks again? |
|||
msg201403 - (view) | Author: Christian Heimes (christian.heimes) * | Date: 2013-10-26 23:16 | |
OpenSSL kinda takes care of that. The PRNG mixes the PID into the seed so parent and child won't see the same output. The patch fixes a slightly more tricky issue: - parent initializes PRNG but never touches the PRNG again - parent forks lots of child processes - PIDs get reused -- two child processes with same PID get same pseudorandom data |
|||
msg201432 - (view) | Author: Georg Brandl (georg.brandl) * | Date: 2013-10-27 06:55 | |
OK, I'm going to apply atfork_prepare.patch for 3.3.3rc1. If there is any indication of a continuing problem, we'll revert the whole re-seeding code for the final. |
|||
msg201433 - (view) | Author: Roundup Robot (python-dev) | Date: 2013-10-27 06:57 | |
New changeset 021ca321b26e by Georg Brandl in branch '3.3': Issue #19227: Try to fix deadlocks caused by re-seeding then OpenSSL http://hg.python.org/cpython/rev/021ca321b26e New changeset d3a13a7be9e9 by Georg Brandl in branch 'default': #19227: merge with 3.3 http://hg.python.org/cpython/rev/d3a13a7be9e9 |
|||
msg201436 - (view) | Author: Charles-François Natali (neologix) * | Date: 2013-10-27 08:57 | |
Note that no matter where the atfork-hook is executed (prepare, parent, child), it will still be deadlock-prone in case of double-fork, since the hook will effectively be called after the first fork (and before the second one). And double-fork is common, e.g. when creating daemon processes. |
|||
msg201662 - (view) | Author: Roundup Robot (python-dev) | Date: 2013-10-29 20:18 | |
New changeset 5942eea8cf41 by Christian Heimes in branch '3.3': Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding http://hg.python.org/cpython/rev/5942eea8cf41 New changeset 705f2addd0f0 by Christian Heimes in branch 'default': Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding http://hg.python.org/cpython/rev/705f2addd0f0 New changeset 4d761ce0ac74 by Christian Heimes in branch '2.7': Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding http://hg.python.org/cpython/rev/4d761ce0ac74 |
|||
msg201665 - (view) | Author: Roundup Robot (python-dev) | Date: 2013-10-29 20:24 | |
New changeset ad779da9e351 by Christian Heimes in branch '2.7': Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding http://hg.python.org/cpython/rev/ad779da9e351 |
|||
msg203128 - (view) | Author: Roundup Robot (python-dev) | Date: 2013-11-17 08:19 | |
New changeset 4e6aa98bb11c by Christian Heimes in branch '3.3': Issue #19227 / Issue #18747: Remove pthread_atfork() handler to remove OpenSSL re-seeding http://hg.python.org/cpython/rev/4e6aa98bb11c |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:51 | admin | set | github: 63426 |
2013-11-17 08:19:33 | python-dev | set | messages: + msg203128 |
2013-11-09 19:59:39 | benjamin.peterson | set | status: open -> closed resolution: fixed |
2013-10-29 20:24:54 | python-dev | set | messages: + msg201665 |
2013-10-29 20:18:47 | python-dev | set | messages: + msg201662 |
2013-10-28 21:21:12 | nadeem.vawda | set | nosy:
+ nadeem.vawda |
2013-10-27 08:57:12 | neologix | set | messages: + msg201436 |
2013-10-27 06:57:08 | python-dev | set | messages: + msg201433 |
2013-10-27 06:55:51 | georg.brandl | set | messages: + msg201432 |
2013-10-27 06:41:09 | Arfrever | set | nosy:
+ Arfrever |
2013-10-27 05:59:44 | georg.brandl | link | issue18747 dependencies |
2013-10-26 23:16:03 | christian.heimes | set | messages: + msg201403 |
2013-10-26 23:12:36 | sbt | set | messages: + msg201402 |
2013-10-26 23:05:55 | christian.heimes | set | files:
+ atfork_prepare.patch messages: + msg201401 |
2013-10-26 16:45:58 | benjamin.peterson | set | messages: + msg201371 |
2013-10-26 15:42:17 | christian.heimes | set | messages: + msg201368 |
2013-10-21 12:50:45 | neologix | set | messages: + msg200771 |
2013-10-20 18:08:58 | christian.heimes | set | files:
+ unlock_ssl_locks.patch keywords: + patch messages: + msg200601 |
2013-10-19 13:07:56 | neologix | set | messages: + msg200415 |
2013-10-19 12:57:08 | christian.heimes | set | priority: critical -> release blocker versions: + Python 2.7 nosy: + benjamin.peterson, larry, georg.brandl messages: + msg200414 |
2013-10-17 23:54:00 | sbt | set | messages: + msg200176 |
2013-10-17 20:42:53 | sbt | set | messages: + msg200166 |
2013-10-17 18:02:14 | r.david.murray | set | messages: + msg200152 |
2013-10-17 17:34:03 | sbt | set | messages: + msg200147 |
2013-10-17 16:18:24 | r.david.murray | set | messages: + msg200143 |
2013-10-17 14:52:09 | sbt | set | messages: + msg200131 |
2013-10-12 23:55:08 | python-dev | set | nosy:
+ python-dev messages: + msg199643 |
2013-10-12 10:56:32 | sbt | set | messages: + msg199546 |
2013-10-12 05:58:50 | neologix | set | messages: + msg199540 |
2013-10-12 05:55:09 | neologix | set | messages: + msg199539 |
2013-10-12 01:34:41 | r.david.murray | set | messages: + msg199534 |
2013-10-12 01:30:48 | vstinner | set | nosy:
+ neologix messages: + msg199533 |
2013-10-12 01:23:48 | r.david.murray | set | messages: + msg199529 |
2013-10-11 21:41:37 | r.david.murray | set | messages: + msg199509 |
2013-10-11 21:02:39 | vstinner | set | nosy:
+ vstinner |
2013-10-11 18:51:12 | pitrou | create |