classification
Title: Forking in a thread raises RuntimeError
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Oren_Held, csernazs, dagobert, doughellmann, gregory.p.smith, jednaszewski, loewis, pitrou, twouters
Priority: high Keywords: patch

Created on 2009-10-30 10:58 by csernazs, last changed 2010-12-18 16:55 by Oren_Held. This issue is now closed.

Files
File name Uploaded Description Edit
thread_test.py csernazs, 2009-10-30 10:57 A test demonstrating the issue.
patch_1.diff csernazs, 2009-11-02 15:51 A possible fix for this issue created for python 2.6.4.
patch_2.diff csernazs, 2010-01-26 13:27 Patch re-initializing import_lock on all platforms after fork().
thread_unittest.py csernazs, 2010-02-26 10:18 Unittest for the patch
7242_unittest.diff jednaszewski, 2010-02-26 20:27 Unit test that tests the issue
issue7242-gps01.diff gregory.p.smith, 2010-02-28 09:01 update of patch_2 + thread_unittest
Messages (26)
msg94701 - (view) Author: Zsolt Cserna (csernazs) Date: 2009-10-30 10:57
Python 2.6.4 (r264:75706, Oct 29 2009, 12:00:12) [C] on sunos5

On my sunos5 system if I call os.fork() in a thread created by
thread.start_new_thread(), it raises RuntimeError with the message 'not
holding the import lock'. It's a vanilla python compiled on sunos5 with
suncc. In 2.6.3 it's ok, I think this issue is caused by patch located
at http://codereview.appspot.com/96125/show.

The problem can be re-produced by running the script attached.

I've looked into the source and it seems to me the following:

Based on the the change above, it calls fork1() system call between a
lock-release calls:

3635 » _PyImport_AcquireLock();
3636 » pid = fork1();
3637 » result = _PyImport_ReleaseLock();

_PyImport_ReleaseLock is called in both the child and parent. Digging
more into the code, _PyImport_ReleaseLock starts with the following:

        long me = PyThread_get_thread_ident();
        if (me == -1 || import_lock == NULL)
                return 0; /* Too bad */
        if (import_lock_thread != me)
                return -1;

In the above code, if I interpret correctly, it compares the result of
the current thread id returned by PyThread_get_thread_ident call with
the thread id of the thread holding the lock - if it's different then
the current thread cannot release the lock because it's not owned by it.

Based on my results on solaris the 'me' variable will be different in
the parent and in the child (in parent it remains the same) - resulting
that the child thinks that it's not holding the release lock.

I'm using pthreads on both linux and solaris. On linux the code above is
working fine, but on solaris it behaves differently.
msg94702 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-30 11:06
Well, there has been no such change between 2.6.3 and 2.6.4.
msg94704 - (view) Author: Zsolt Cserna (csernazs) Date: 2009-10-30 11:41
Sorry, the working version is not 2.6.3 (I mistyped the version), it's
2.6.1 (I've no info about 2.6.2).
msg94765 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-10-31 19:50
This only appears to happen on Solaris.  What version of Solaris are you 
using?  (i doubt that matters, i expect it happens on all versions)

I haven't look closely enough at the code yet, but reinitializing the 
import lock in the child process should make sense here.
msg94820 - (view) Author: Zsolt Cserna (csernazs) Date: 2009-11-02 08:31
I've tested it only on solaris 8, 32-bit.
msg94822 - (view) Author: Zsolt Cserna (csernazs) Date: 2009-11-02 09:44
solaris 10 x86, 32-bit, sun-studio 11 is ok (in this case the parent's
thread has thread_id=2 and the child inherits this id)
solaris 8 sparc4, 32-bit, sun-studio 11 is not working

So it seems it's independent from sun-cc but depends from the
architecture and/or the OS.
msg94832 - (view) Author: Zsolt Cserna (csernazs) Date: 2009-11-02 15:51
I've attached a patch which seems to fix this issue. It sets
import_lock_thread to the current thread id after forking in the child
process, but still I'm not quite sure that it's the correct way of
solving this issue.
msg95075 - (view) Author: Zsolt Cserna (csernazs) Date: 2009-11-09 12:38
Additional info:
I've tested it on solaris 10 / sparc 32-bit, and my test script runs
fine on that.
Based on my test it seems that this bug does not affect solaris 10.
msg98148 - (view) Author: Zsolt Cserna (csernazs) Date: 2010-01-22 15:09
Could you please provide any advise on this bug?
I've submitted my patch, would be curious if there any chance to get it merged.
msg98211 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-01-24 10:01
One relevant difference may be the change to the fork semantics in Solaris 10: fork() now means the same as fork1(). Before, fork() meant the same as forkall(), unless the applications was linked with -lpthread, in which case fork() also meant fork1(). See fork(2).

So I'd be curious if a) the test case passes on Solaris 8 if fork1 is being used, and b) whether it passes if you make sure -lpthread is being used. If so, I'd rather fix this issue by changing the linkage for Solaris 8 (or declaring that system as unsupported altogether), instead of fiddling yet again with the fork handling. In Python, posix.fork definitely needs to mean "POSIX fork".

If it's something else (i.e. the thread ID changes in Solaris 8 whether fork1 or forkall is used), then the patch is fine in principle. However, I would always reset the thread ID. In your patch, it is not clear why you apply this resetting to Solaris and AIX, but not other systems.
msg98287 - (view) Author: Zsolt Cserna (csernazs) Date: 2010-01-25 13:32
I compile it with -lpthread.
os.fork1() was not available by default, I enabled it by removing two lines from posixmodule.c (it seems it's only enabled when #if defined(__USLC__) && defined(__SCO_VERSION__) is true).

With os.fork1() I have the same results, RuntimeError.

I was not able to compile it without pthread because I haven't found any configure options for that. If it's possible I'm happy to try it.

In my patch I wanted to reduce the effect on systems where forking in thread is working (eg. linux), that's the reason why I added "(defined (__SVR4) && defined (__sun)". But it just checks for solaris, not the OS version (on solaris 10/intel my demo is working).

(btw forking in thread  actually happens in a unittest related to BaseHTTPServer, which obviously fails on my platform)
msg98299 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-01-25 21:15
> In my patch I wanted to reduce the effect on systems where forking in
> thread is working (eg. linux), that's the reason why I added
> "(defined (__SVR4) && defined (__sun)". 

I think that's inappropriate, please change that.
msg98335 - (view) Author: Zsolt Cserna (csernazs) Date: 2010-01-26 13:27
Ok, here's the new patch. I've removed the #ifdef-#endif lines. It passed the test thread_test.py on linux (and as well on solaris).
msg99202 - (view) Author: Dagobert Michelsen (dagobert) Date: 2010-02-11 08:35
I verified patch_2.diff on Solaris 10 w/SOS11 and it actually resolves a number of issues I had with Mercurial.
msg100090 - (view) Author: Greg Jednaszewski (jednaszewski) Date: 2010-02-25 16:20
I verified that patch_2.diff resolves the problem for me on Solaris 9/SPARC.
msg100100 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-25 19:45
The patch should really add an unit test for this.
msg100143 - (view) Author: Zsolt Cserna (csernazs) Date: 2010-02-26 10:18
There's a bundled unittest in test_httpservers which actually fails if this patch is not applied.

See the unittest attached. Unfortunatelly RuntimError is raised in the child process after fork() so I cannot re-raise it to the parent, instead I send a message from the child to the parent via a pipe if the child is ok.
msg100144 - (view) Author: Zsolt Cserna (csernazs) Date: 2010-02-26 10:20
I've run unittest with both patched and vanilla versions of python 2.6.4 and I confirm that it fails when it's run by vanilla on solaris 8, but passes on the patched version.
msg100171 - (view) Author: Greg Jednaszewski (jednaszewski) Date: 2010-02-26 20:27
I spent some time working on and testing a unit test as well.  It's the same basic idea as Zsolt Cserna's, but with a slightly different approach.  See 7242_unittest.diff.  My unittest fails pre-patch and succeeds post-patch.

However, I still have reservations about the patch.  The existing test test_threading.ThreadJoinOnShutdown.test_3_join_in_forked_from_thread hangs with the patch in place.

Vanilla 2.6.2 - test passes
Vanilla 2.6.4 - test fails
Patched 2.6.4 - test hangs

Note: the code of the test_threading test is identical in all 3 cases.  I'd feel more confident about the patch if this test didn't hang with the patch in place.
msg100200 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-02-28 09:01
Using issue7242-gps01.diff on release26-maint and a freshly downloaded opensolaris 2009-06 VM test_thread, test_threading and test_subprocess all pass for me both before -and- after the patch.  Nor does the original thread_test.py cause the problem for me (meaning I'm unable to reproduce the problem in this VM in the first place so... someone else needs to)

Regardless, I altered patch_2 a bit in that diff to do what I though _PyImport_ReInitLock() should really do.  I also added the thread_unittest testcase to that diff.

% uname -a
SunOS opensolaris-vm 5.11 snv 111b i86pc i386 i86pc Solaris"

Can someone who could reproduce the problem in the first place please test that patch.

The logic change makes sense to me.  I don't know why test_threading.ThreadJoinOnShutdown.test_3_join_in_forked_from_thread would be changing behavior for you.
msg100204 - (view) Author: Greg Jednaszewski (jednaszewski) Date: 2010-02-28 15:22
The problem only seems to appear on Solaris 9 and earlier.  I'll try to test the updated patch tonight or tomorrow and let you know what I find.
msg100211 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-02-28 17:50
If you have a chance tonight that'd be awesome.  I'd love to get this in before 2.6.5rc1 (being cut tomorrow) but as its platform specific (and a pretty-old platform at that) its not worth holding up the release.
msg100231 - (view) Author: Greg Jednaszewski (jednaszewski) Date: 2010-03-01 00:56
I tested the updated patch, and the new unit test passes on my Sol 8 sparc, but the test_threading test still hangs on my system.  However, given that the test is skipped on several platforms and it does work on more relevant versions of Solaris, it's probably OK.  It's possible that an OS bug is causing that particular hang.

Plus, the original patch fixed the 'real world' scenario I was running into, so I'd like to see it get into the release candidate if you're OK with it.
msg100233 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-03-01 01:23
Agreed.

Committed to trunk r78527.  I'll wait for the buildbots before merging to release26-maint.
msg100245 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-03-01 04:41
r78543 backports this to release26-maint, it should be in 2.6.5 assuming no show stopper issue comes up with this during the release candidates.
msg124294 - (view) Author: Oren Held (Oren_Held) Date: 2010-12-18 16:55
Just adding some info:
This bug is not Solaris-specific; I reproduced it on HP-UX 11.31.

On Python 2.6.4, thread_test.py fails with the same RunTime error exception.

On Python 2.6.6, it passes and things look good.
History
Date User Action Args
2010-12-18 16:55:12Oren_Heldsetnosy: + Oren_Held
messages: + msg124294
2010-03-01 04:41:35gregory.p.smithsetstatus: open -> closed
resolution: fixed
messages: + msg100245
2010-03-01 01:23:29gregory.p.smithsetmessages: + msg100233
2010-03-01 00:56:43jednaszewskisetmessages: + msg100231
2010-02-28 17:50:17gregory.p.smithsetmessages: + msg100211
2010-02-28 15:22:30jednaszewskisetmessages: + msg100204
2010-02-28 09:01:18gregory.p.smithsetfiles: + issue7242-gps01.diff
priority: normal -> high
nosy: loewis, twouters, gregory.p.smith, csernazs, pitrou, doughellmann, dagobert, jednaszewski
messages: + msg100200

components: + Extension Modules, - Library (Lib)
2010-02-26 20:27:33jednaszewskisetfiles: + 7242_unittest.diff

messages: + msg100171
2010-02-26 10:20:20csernazssetmessages: + msg100144
2010-02-26 10:18:17csernazssetfiles: + thread_unittest.py

messages: + msg100143
2010-02-25 19:45:03pitrousetmessages: + msg100100
2010-02-25 16:20:16jednaszewskisetnosy: + jednaszewski
messages: + msg100090
2010-02-25 00:39:15doughellmannsetnosy: + doughellmann
2010-02-11 08:35:04dagobertsetnosy: + dagobert
messages: + msg99202
2010-01-26 13:27:08csernazssetfiles: + patch_2.diff

messages: + msg98335
2010-01-25 21:15:29loewissetmessages: + msg98299
2010-01-25 13:32:41csernazssetmessages: + msg98287
2010-01-24 10:01:45loewissetmessages: + msg98211
2010-01-22 16:11:37r.david.murraysetnosy: + loewis
2010-01-22 15:09:30csernazssetmessages: + msg98148
2009-11-09 12:38:08csernazssetmessages: + msg95075
2009-11-02 15:51:54csernazssetfiles: + patch_1.diff
keywords: + patch
messages: + msg94832
2009-11-02 09:44:50csernazssetmessages: + msg94822
2009-11-02 08:31:13csernazssetmessages: + msg94820
2009-10-31 19:50:03gregory.p.smithsetpriority: normal
versions: + Python 3.1, Python 2.7, Python 3.2
nosy: + twouters

messages: + msg94765

assignee: gregory.p.smith
2009-10-30 11:41:27csernazssetmessages: + msg94704
2009-10-30 11:06:41pitrousetnosy: + gregory.p.smith, pitrou
messages: + msg94702
2009-10-30 10:58:01csernazscreate