msg288510 - (view) |
Author: Ján Stanček (Ján Stanček) |
Date: 2017-02-24 09:03 |
Following crash is sporadically observed in RHEL7 anaconda:
(gdb) f 0
#0 PyThread_ReInitTLS () at /usr/src/debug/Python-2.7.5/Python/thread.c:411
411 if (p->id != id) {
(gdb) list
406 keymutex = PyThread_allocate_lock();
407
408 /* Delete all keys which do not match the current thread id */
409 q = &keyhead;
410 while ((p = *q) != NULL) {
411 if (p->id != id) {
412 *q = p->next;
413 free((void *)p);
414 /* NB This does *not* free p->value! */
415 }
(gdb) p p
$1 = (struct key *) 0x3333333333333333
(gdb) p keyhead
$2 = (struct key *) 0x3333333333333333
key list is protected by keymutex (except for PyThread_ReInitTLS), but there doesn't seem to be any protection against concurrent fork(). What seems to happen is fork() at the moment when key list is not consistent.
For example, if I initialize new key to 0xfe:
static struct key *find_key(int key, void *value)
// find_key with extra memset()
...
p = (struct key *)malloc(sizeof(struct key));
memset(p, 0xfe, sizeof(struct key));
if (p != NULL) {
p->id = id;
p->key = key;
p->value = value;
p->next = keyhead;
keyhead = p;
}
...
Looking at disassembly, compiler reordered last 2 writes:
0x00000000004fcb50 <+272>: callq 0x413d10 <malloc@plt>
0x00000000004fcb55 <+277>: movabs $0xfefefefefefefefe,%rcx
0x00000000004fcb5f <+287>: test %rax,%rax
0x00000000004fcb62 <+290>: mov %rcx,(%rax)
0x00000000004fcb65 <+293>: mov %rcx,0x8(%rax)
0x00000000004fcb69 <+297>: mov %rcx,0x10(%rax)
0x00000000004fcb6d <+301>: mov %rcx,0x18(%rax)
0x00000000004fcb71 <+305>: je 0x4fcaff <PyThread_set_key_value+191>
0x00000000004fcb73 <+307>: mov 0x2f1e26(%rip),%rdx # 0x7ee9a0 <keyhead>
0x00000000004fcb7a <+314>: mov 0x2f1dff(%rip),%rdi # 0x7ee980 <keymutex>
0x00000000004fcb81 <+321>: xor %r14d,%r14d
0x00000000004fcb84 <+324>: mov %rbp,0x8(%rax)
0x00000000004fcb88 <+328>: mov %r12d,0x10(%rax)
0x00000000004fcb8c <+332>: mov %r13,0x18(%rax)
0x00000000004fcb90 <+336>: mov %rax,0x2f1e09(%rip) # 0x7ee9a0 <keyhead>
0x00000000004fcb97 <+343>: mov %rdx,(%rax)
Now consider what happens, when different threads call fork() in between these 2 writes: we updated keyhead, but keyhead->next has not been updated yet.
Now when anaconda hangs, I get:
(gdb) list
407 keymutex = PyThread_allocate_lock();
408
409 /* Delete all keys which do not match the current thread id */
410 q = &keyhead;
411 while ((p = *q) != NULL) {
412 if (p->id != id) {
413 *q = p->next;
414 free((void *)p);
415 /* NB This does *not* free p->value! */
416 }
(gdb) p p
$1 = (struct key *) 0xfefefefefefefefe
(gdb) p keyhead
$2 = (struct key *) 0xfefefefefefefefe
Here's how I think we get into this state:
-------------------------> thread 1
# has GIL
Thread.start
_start_new_thread(self.__bootstrap, ())
PyThread_start_new_thread(t_bootstrap)
# spawns thread 3
-------------------------> thread 2
...
# waiting for GIL
-------------------------> thread 3
t_bootstrap
_PyThreadState_Init # does not have GIL yet at this point
_PyGILState_NoteThreadState
PyThread_set_key_value(autoTLSkey, (void *)tstate)
find_key
# key list is temporarily not consistent
# due to compiler reordering couple writes in find_key
-------------------------> thread 1
continuing Thread.start
self.__started.wait()
Event.wait()
self.__cond.wait
Condition.wait()
waiter = _allocate_lock()
waiter.acquire()
lock_PyThread_acquire_lock
Py_BEGIN_ALLOW_THREADS
PyEval_SaveThread
PyThread_release_lock(interpreter_lock);
-------------------------> thread 2
...
# acquired GIL
os.fork() # forks inconsistent list
-------------------------> child
PyOS_AfterFork()
PyThread_ReInitTLS()
SIGSEGV
Attached patch for python makes it easier to reproduce, by adding delays to couple places to make window key list is not consistent larger.
|
msg288753 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2017-03-01 11:14 |
Here is a proof of concept patch from Jaroslav Škarvada. It fixes the problem by holding the mutex used for PyThread_create_key while forking.
To make it more than PoC it needs adding _PyThread_AcquireKeyLock and _ReleaseKeyLock (similar to _PyImport_AcquireLock() etc.) and calling those. Other than that, does this approach look reasonable?
|
msg289723 - (view) |
Author: Charalampos Stratakis (cstratak) * |
Date: 2017-03-16 17:13 |
In order to reproduce:
Apply the python.patch from bz1268226_reproducer2.tar.gz
Compile python
Run the reproduce4.py from bz1268226_reproducer2.tar.gz
As indicated by the reproducer, the status returned by os.wait() for the child is 139.
I will refine a bit the patch and work on a PR.
|
msg289993 - (view) |
Author: Charalampos Stratakis (cstratak) * |
Date: 2017-03-22 15:56 |
Patch for protecting the key list while forking.
|
msg299299 - (view) |
Author: Thomas Mortensson (Thomas Mortensson) |
Date: 2017-07-27 09:40 |
Hey, any status update on this bug? Suffered a similar issue on a Centos 6.5 kernel when spawning multiple processes in a Twisted environment. Is this PR targeted for inclusion into the source tree?
Thanks,
Tom
|
msg309700 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2018-01-09 14:58 |
I'm a bit out of my depth here. Victor, could you chime in?
The problem with Harris' patch is that, once fork() is protected by the thread lock, acquiring that lock (by e.g. calling `PyGILState_GetThisThreadState`) from an `atfork` handler hangs deadlocks.
I thought that can be solved by doing the locking in an atfork handler, but that's not working out -- CPython's pthread_atfork (which would lock _PyThread_AcquireKeyLock for the duration of the fork) would need to be called *after* an extension's pthread_atfork (which needs the thread lock temporarily).
|
msg309705 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-09 15:58 |
There is a more general issue for any lock and fork(): bpo-6721, "Locks in the standard library should be sanitized on fork".
|
msg309712 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-09 16:21 |
Python 3 is not affected by this issue because it uses native thread locale storage (TLS):
* pthread: pthread_getspecific() / pthread_setspecific()
* Windows: TlsGetValue() / TlsSetValue()
I'm not sure that it's doable to backport such enhancement, since Python 2.7 supports many thread implementations, not only NT (Windows) and pthread:
* Python/thread_atheos.h
* Python/thread_beos.h
* Python/thread_cthread.h
* Python/thread_lwp.h
* Python/thread_nt.h
* Python/thread_os2.h
* Python/thread_pth.h
* Python/thread_pthread.h
* Python/thread_sgi.h
* Python/thread_solaris.h
* Python/thread_wince.h
Maybe it's doable for a Linux vendor, but it's going to be a large change that has to be maintained downstream :-/
|
msg309725 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2018-01-09 19:18 |
Gah! The more I look into locks & forks ... the more I learn, to put it mildly.
Instead of piling on workarounds, I'll try my hand at using native thread-local storage for pthread, and avoid the locking altogether.
Hopefully that can make it in as a bugfix? At least for this bug, it most likely *is* the most appropriate fix -- though I can only fix it for pthread.
|
msg309731 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2018-01-09 19:45 |
WIP pull request: https://github.com/python/cpython/pull/5141
(I'll not get back to this for a few days, sadly.)
|
msg316649 - (view) |
Author: Stéphane Wirtel (matrixise) *  |
Date: 2018-05-15 12:22 |
Hi Petr,
Do you continue this patch/issue?
Thank you
|
msg316655 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2018-05-15 13:31 |
Not immediately, but it is on my TODO list.
If anyone wants to tackle it in the mean time, I'd be happy to answer any questions
|
msg317077 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-05-19 00:20 |
Oh, it seems like I was wrong in my previous comment. Python 2.7 code base is already designed to support native TLS. It's just that we only implement native TLS on Windows. So yeah, it seems doable to implement native TLS for pthread.
History of Py_HAVE_NATIVE_TLS:
commit 8d98d2cb95ac37147a4de5a119869211e8351324
Author: Mark Hammond <mhammond@skippinet.com.au>
Date: Sat Apr 19 15:41:53 2003 +0000
New PyGILState_ API - implements pep 311, from patch 684256.
commit 00f2df495a6fcb40d70243c34f296f26ccc72719
Author: Kristján Valur Jónsson <kristjan@ccpgames.com>
Date: Fri Jan 9 20:03:27 2009 +0000
Issue 3582. Improved thread support and TLS for Windows
|
msg317184 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2018-05-20 16:28 |
pthread is not generally compatible with int, so it can't be used with Python 2's API.
This was changed in py3 with PEP 539, which however makes it so that the old TLS API is only available if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT.
|
msg317273 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-05-22 13:02 |
> the old TLS API is only available if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT.
PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined on most (pthread) platforms, no? I understood that the PEP 539 is mostly designed for Cygwin, a platform which is not officially supported by Python. At least, PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is set to 1 on my Fedora 27 (Linux).
I propose to cast pthread_key_create() result to int, but only define PyThread_create_key() in Python/thread_pthread.h if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.
It means that the pthread implementation of Python would still have this bug (race condition) if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is not defined. But backporting the PEP 539 to Python 2.7 doesn't seem worth it.
What do you think?
|
msg317276 - (view) |
Author: Petr Viktorin (petr.viktorin) *  |
Date: 2018-05-22 13:18 |
> PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined on most (pthread) platforms, no?
> I propose to cast pthread_key_create() result to int, but only define PyThread_create_key() in Python/thread_pthread.h if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.
I don't think that's a good idea. Changing API, even for platforms that aren't officially supported, sounds very harsh this late in the release cycle.
But! I suppose we could fix the bug only for platforms with PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT. Other platforms would keep the current implementation -- they'd still have the bug, but the API would stay unchanged.
|
msg317291 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-05-22 14:02 |
> But! I suppose we could fix the bug only for platforms with PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT.
Yes, this is my proposal.
>> I propose to cast pthread_key_create() result to int, but only define PyThread_create_key() in Python/thread_pthread.h if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.
> I don't think that's a good idea. Changing API, even for platforms that aren't officially supported, sounds very harsh this late in the release cycle.
Which API change? I don't propose to modify the existing public C API "int PyThread_create_key(void)". I only propose to change it's implementation to the native pthread API when PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.
|
msg367389 - (view) |
Author: Zachary Ware (zach.ware) *  |
Date: 2020-04-27 04:19 |
As 2.7 is now EOL, I'm closing the issue.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:43 | admin | set | github: 73826 |
2020-04-27 04:19:17 | zach.ware | set | status: open -> closed
nosy:
+ zach.ware messages:
+ msg367389
resolution: out of date stage: patch review -> resolved |
2018-05-22 14:02:36 | vstinner | set | messages:
+ msg317291 |
2018-05-22 13:18:02 | petr.viktorin | set | messages:
+ msg317276 |
2018-05-22 13:02:13 | vstinner | set | messages:
+ msg317273 |
2018-05-20 16:28:17 | petr.viktorin | set | messages:
+ msg317184 |
2018-05-19 00:20:50 | vstinner | set | messages:
+ msg317077 |
2018-05-15 13:31:06 | petr.viktorin | set | messages:
+ msg316655 |
2018-05-15 12:22:45 | matrixise | set | nosy:
+ matrixise messages:
+ msg316649
|
2018-01-09 19:45:42 | petr.viktorin | set | messages:
+ msg309731 |
2018-01-09 19:44:33 | petr.viktorin | set | stage: patch review pull_requests:
+ pull_request5000 |
2018-01-09 19:18:38 | petr.viktorin | set | messages:
+ msg309725 |
2018-01-09 16:21:41 | vstinner | set | messages:
+ msg309712 |
2018-01-09 15:58:32 | vstinner | set | messages:
+ msg309705 |
2018-01-09 14:58:10 | petr.viktorin | set | messages:
+ msg309700 |
2017-07-27 09:40:04 | Thomas Mortensson | set | nosy:
+ Thomas Mortensson messages:
+ msg299299
|
2017-07-19 10:03:13 | vstinner | set | nosy:
+ vstinner
|
2017-04-05 16:04:29 | pitrou | set | nosy:
+ neologix
|
2017-04-05 04:54:10 | rhettinger | set | nosy:
+ davin
|
2017-03-23 13:39:44 | cstratak | set | pull_requests:
+ pull_request687 |
2017-03-22 15:56:56 | cstratak | set | files:
+ 0001-Protect-key-list-during-fork.patch
messages:
+ msg289993 |
2017-03-16 17:13:56 | cstratak | set | files:
+ python.patch
messages:
+ msg289723 |
2017-03-01 11:14:07 | petr.viktorin | set | files:
+ fork_hold_keymutex.patch
nosy:
+ petr.viktorin messages:
+ msg288753
keywords:
+ patch |
2017-02-24 15:56:38 | cstratak | set | nosy:
+ cstratak
|
2017-02-24 09:03:58 | Ján Stanček | create | |