Author neologix
Recipients DazWorrall, aconrad, alex, andrix, brian.curtin, carljm, coderanger, cool-RR, dabeaz, djc, durin42, eric.araujo, eric.smith, flox, gregory.p.smith, jcea, jhylton, karld, kevinwatters, konryd, larry, loewis, mahmoudimus, movement, neologix, nirai, pitrou, rcohen, rh0dium, salgado, tarek, thouis, ysj.ray
Date 2010-04-27.09:23:28
SpamBayes Score 9.48741e-13
Marked as misclassified No
Message-id <1272360214.21.0.902766014317.issue7946@psf.upfronthosting.co.za>
In-reply-to
Content
@dabeaz
I'm getting random segfaults with your patch (even with the last one), pretty much everywhere malloc or free is called.
Ater skimming through the code, I think the problem is due to gil_last_holder:
In drop_gil and take_gil, you dereference gil_last_holder->cpu_bound, but it might very well happen that gil_last_holder points to a thread that has been deleted (through tstate_delete_common). Dereferencing is not risky, because there's a high chance that the address is still valid, but in drop_gil, you do this:
    /* Make the thread as CPU-bound or not depending on whether it was forced off */
    gil_last_holder->cpu_bound = gil_drop_request;

Here, if the thread has been deleted in meantine, you end up writting to a random location on the heap, and probably corrupting malloc administration data, which would explain why I get segfaults sometimes later on unrelated malloc() or free() calls.
I looked at it really quickly though, so please forgive me if I missed something obvious ;-)

@nirai
I have some more remarks on your patch:
- /* Diff timestamp capping results to protect against clock differences 
 * between cores. */
_LOCAL(long double) _bfs_diff_ts(long double ts1, long double ts0) {

I'm not sure I understand. You can have problem with multiple cores when reading directly the TSC register, but that doesn't affect gettimeofday. gettimeofday should be reliable and accurate (unless the OS is broken of course), the only issue is that since it's wall clock time, if a process like ntpd is running, then you'll run into problem
- pretty much all your variables are declared as volatile, but volatile was never meant as a thread-synchronization primitive. Since your variables are protected by mutexes, you already have all necessary memory barriers and synchronization, so volatile just prevents optimization
- you use some funtions just to perform a comparison or substraction, maybe it would be better to just remove those functions and perform the substractions/comparisons inline (you declared the functions inline but there's no garantee that the compiler will honor it).
- did you experiment with the time slice ? I tried some higher values and got better results, without penalizing the latency. Maybe it could be interesting to look at it in more detail (and on various platforms).
History
Date User Action Args
2010-04-27 09:23:34neologixsetrecipients: + neologix, loewis, jhylton, gregory.p.smith, jcea, pitrou, movement, larry, eric.smith, kevinwatters, tarek, djc, karld, carljm, coderanger, durin42, eric.araujo, nirai, alex, andrix, konryd, brian.curtin, flox, DazWorrall, salgado, cool-RR, rh0dium, rcohen, dabeaz, mahmoudimus, aconrad, ysj.ray, thouis
2010-04-27 09:23:34neologixsetmessageid: <1272360214.21.0.902766014317.issue7946@psf.upfronthosting.co.za>
2010-04-27 09:23:30neologixlinkissue7946 messages
2010-04-27 09:23:28neologixcreate