[argobots-discuss] help understanding mutex_unlock logic

Halim Amer aamer at anl.gov
Tue Aug 15 09:21:29 CDT 2017


Hi Phil,

Thank you for the detailed report and for the PR. I will review the 
problem and the PR timely and get back to you.

Best,

Halim
www.mcs.anl.gov/~aamer

On 8/14/17 8:42 PM, Phil Carns wrote:
> I just opened a pull request that fixes the problem at
> https://github.com/pmodels/argobots/pull/22.  The issue ended up being a
> field that's not initialized properly when a thread is pushed onto a
> queue, but is needed by the pop logic later if there is more than one
> item in the queue.
>
> I still feel like probably
> https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802 should
> be an assertion rather than silently returning (a lock is held all the
> way from checking the element count to failing to find a thread, so it
> seems like an inconsistent scenario to hit that code path), but I did
> not address that in the pull request. I'd like someone to sanity check
> that part :)
>
> thanks,
> -Phil
>
> On 08/14/2017 04:49 PM, Phil Carns wrote:
>> I have a little bit more detail, but no root cause yet, on what's
>> going wrong here.  The ABTI_thread_queue data structure is definitely
>> corrupted in my case.  I've traced a call to ABTI_thread_htable_pop()
>> where (going into the function) p_queue->num_threads == 1,
>> p_queue->head == NULL, and p_queue->tail != NULL.
>>
>> There were 3 pushes and 2 pops before that point, so I think the
>> num_threads and tail variables are correct, but the head variable is
>> wrong.  The head and tail should both point to the same thing.
>>
>> Possibly a memory corruption in my own code, though.  I'll keep digging.
>>
>> thanks,
>> -Phil
>>
>> On 08/11/2017 05:14 PM, Phil Carns wrote:
>>> Hi all,
>>>
>>> I'm trying to debug a problem with a custom scheduler
>>> (https://xgitlab.cels.anl.gov/sds/abt-snoozer/issues/8).  To make a
>>> long story short, sometimes we can trigger a scenario where unlocking
>>> a mutex does not wake up any ULTs that are blocked on it.
>>>
>>> I've traced something suspicious in ABTI_mutex_wake_de() path leading
>>> up to the deadlock, but I'm not sure if this is an Argobots bug or if
>>> I just don't understand the logic.  Can someone help sanity check this?
>>>
>>> In the ABTI_mutex_wake_de() invocation just before the hang, the
>>> value of num_elem is 1 at this line:
>>>
>>> https://github.com/pmodels/argobots/blob/master/src/mutex.c#L769
>>>
>>> ... but then after checking the high and low priority lists it falls
>>> through to here, having not found anything to wake up:
>>>
>>> https://github.com/pmodels/argobots/blob/master/src/mutex.c#L802
>>>
>>> Is that code path supposed to be possible?
>>>
>>> It seems non intuitive that the count could  be > 0 but it can't find
>>> a thread to wake up, but there might be a more subtle meaning to
>>> num_elem.  The particular scheduler that I am debugging will be
>>> blocking/sleeping until it finds work to do, so it's important that
>>> this path ultimately triggers a pool push or else it won't make
>>> progress.  That may not be an issue with other schedulers.
>>>
>>> thanks!
>>> -Phil
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss at lists.argobots.org
>>> https://lists.argobots.org/mailman/listinfo/discuss
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss at lists.argobots.org
>> https://lists.argobots.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list
> discuss at lists.argobots.org
> https://lists.argobots.org/mailman/listinfo/discuss


More information about the discuss mailing list