[argobots-discuss] help understanding mutex_unlock logic

Phil Carns carns at mcs.anl.gov
Mon Aug 14 15:49:43 CDT 2017


I have a little bit more detail, but no root cause yet, on what's going 
wrong here.  The ABTI_thread_queue data structure is definitely 
corrupted in my case.  I've traced a call to ABTI_thread_htable_pop() 
where (going into the function) p_queue->num_threads == 1, p_queue->head 
== NULL, and p_queue->tail != NULL.

There were 3 pushes and 2 pops before that point, so I think the 
num_threads and tail variables are correct, but the head variable is 
wrong.  The head and tail should both point to the same thing.

Possibly a memory corruption in my own code, though.  I'll keep digging.

thanks,
-Phil

On 08/11/2017 05:14 PM, Phil Carns wrote:
> Hi all,
>
> I'm trying to debug a problem with a custom scheduler 
> (https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8).  To make a 
> long story short, sometimes we can trigger a scenario where unlocking 
> a mutex does not wake up any ULTs that are blocked on it.
>
> I've traced something suspicious in ABTI_mutex_wake_de() path leading 
> up to the deadlock, but I'm not sure if this is an Argobots bug or if 
> I just don't understand the logic.  Can someone help sanity check this?
>
> In the ABTI_mutex_wake_de() invocation just before the hang, the value 
> of num_elem is 1 at this line:
>
> https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769
>
> ... but then after checking the high and low priority lists it falls 
> through to here, having not found anything to wake up:
>
> https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802
>
> Is that code path supposed to be possible?
>
> It seems non intuitive that the count could  be > 0 but it can't find 
> a thread to wake up, but there might be a more subtle meaning to 
> num_elem.  The particular scheduler that I am debugging will be 
> blocking/sleeping until it finds work to do, so it's important that 
> this path ultimately triggers a pool push or else it won't make 
> progress.  That may not be an issue with other schedulers.
>
> thanks!
> -Phil
>
>
> _______________________________________________
> discuss mailing list
> discuss at lists.argobots.org
> https://lists.argobots.org/mailman/listinfo/discuss




More information about the discuss mailing list