[argobots-discuss] help understanding mutex_unlock logic

Phil Carns carns at mcs.anl.gov
Fri Aug 11 16:14:38 CDT 2017


Hi all,

I'm trying to debug a problem with a custom scheduler 
(https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8).  To make a long 
story short, sometimes we can trigger a scenario where unlocking a mutex 
does not wake up any ULTs that are blocked on it.

I've traced something suspicious in ABTI_mutex_wake_de() path leading up 
to the deadlock, but I'm not sure if this is an Argobots bug or if I 
just don't understand the logic.  Can someone help sanity check this?

In the ABTI_mutex_wake_de() invocation just before the hang, the value 
of num_elem is 1 at this line:

https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769

... but then after checking the high and low priority lists it falls 
through to here, having not found anything to wake up:

     https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802

Is that code path supposed to be possible?

It seems non intuitive that the count could  be > 0 but it can't find a 
thread to wake up, but there might be a more subtle meaning to 
num_elem.  The particular scheduler that I am debugging will be 
blocking/sleeping until it finds work to do, so it's important that this 
path ultimately triggers a pool push or else it won't make progress.  
That may not be an issue with other schedulers.

thanks!
-Phil




More information about the discuss mailing list