[argobots-discuss] Possible issue

Iwasaki, Shintaro siwasaki at anl.gov
Thu Nov 1 18:43:01 CDT 2018


Hello, Polykarpos,


I am Shitnaro Iwasaki, who is currently working on Argobots.

Thank you for your reporting an issue!


(0. Bug)


ES1 pushes ULT1 in Pool1 <= (according to the SPMC rule, only ES1 can be a single producer of Pool1)

ES1 pops ULT1 from Pool1

ES1 suspends ULT1 to wait for something (e.g., ABT_self_suspend, ABT_thread_join, or ABT_mutex_lock)

ES2 resumes ULT1 <= In this operation, ES2 pushes ULT1 to Pool1 (breaking the SPMC rule)


1. Is this behavior by design?


For now, this problem is known and it's user's responsibility to deal with it (i.e., using SPMC correctly is challenging).

It can happen even without using a custom pool if the producer/consumer error check is enabled.


2. Workarounds?


2A. Use ABT_unit_set_associated_pool somewhere


There's no good place to put it, since everything happens in Argobots.

Even if you use ABT_self_suspend, don't change the associated pool of the suspended ULT (though I haven't confirmed it).


2B. Disable the suspend-resume optimization in ABT_thread_join and ABT_mutex_lock


For mutex: --enable-simple-mutex might help.

For join: no way to avoid it.

thread.c 426: add "goto yield_based;" can disable it (just FYI)


2C. Use MPMC pool.


3. Workaround in your case


The following is based on my guess.

If you are implementing a scalable (maybe Cilk-like) work stealing queue, you can create a single MPMC pool, which internally contains multiple SPMC queues (per execution stream).

You can differentiate the caller by execution stream rank (ABT_xstream_self_rank) so that you can always push into its local queue.

I think it is the most beautiful workaround at present.


4. Misc


To address this issue, the natural extension of Argobots calls different push/pop functions depending on contexts.


Typically, work stealing queues need to differentiate them:

- push (when creating a thread) and push (when suspending a thread (e.g., yield) (*)), push (when push back to the pool (e.g., set_ready))

- pop (locally) and pop (remotely)


(*) About push (suspend): In general, work stealing queues do not work if you limit local push and pop only to/from the bottom.

For example, ES will reschedule the same ULT after ABT_thread_yield().


Currently Argobots does not distinguish them. We are happy to have discussion about it.


If I am misunderstanding and/or you have any questions, please feel free to send an e-mail (or post a github issue).


Best Regards,

Shintaro iwasaki


________________________________
From: POLYKARPOS THOMADAKIS via discuss <discuss at lists.argobots.org>
Sent: Thursday, November 1, 2018 10:25:17 AM
To: discuss at argobots.org
Cc: POLYKARPOS THOMADAKIS
Subject: [argobots-discuss] Possible issue

Hello there,

I am experiencing the following issue using Argobots, I'm not sure if that's a bug or an assumption made by Argobots that is not explicit.
Here is the issue:

I have a set of custom pools, 1 per Execution Stream.

In each of the pools, only the associated execution stream can push while other streams can grab a unit from other streams' pool safely.
Those are the characteristics of an SPMC pool, so that's the type I specify on their creation.

The underlying data structure is a lock-free cyclic deque where the pushing is done at the bottom, the popping (where the owner stream of the pool grabs a unit)
also at the bottom, and the stealing (where a stream grabs a unit from the pools of another stream) from the top.

In this way when a ULT is created it's always pushed on the pool of the creating stream and can be stolen by other streams safely. The application works perfectly
with just creating and executing ULTs, however, when I need the join functionality is where the problem occurs. The workflow is as follows:

1) ULT 0 spawns its children ULTs in ES 0
2) ES 1 steals one (or more) of the ULT from ES 0
3) ULT 0 on ES 0 joins one of its children -> Argobots suspends its execution
4) ES 1 terminates with the execution of the stolen ULT. This will cause Argobots to try to awaken the blocked ULT 0 by pushing it back to the pool of its last stream,
   which is ES 0.

And here is the problem, ES 1 awakens ULT 0, pushes it to the pool of ES 0, thus, breaking the rule that only the associated stream of a pool is allowed to push to it.
Since the user has no API to defined to which pool a unit shall be pushed in such situations, I believed that by setting the pool type to SPMC Argobots would take care
of this.

The lines that produce this issue:

arch/abtd_thread.c:88 -> Terminating thread awakens blocked joiner
thread.c:2017 -> Joiner is pushed to the last pool it ran before blocking, causing one stream to push to the pools of another

My question is whether this behavior is the expected one or not. In other words, if the user is expected to take into consideration this behavior when designing his/her
custom pools.

Sorry for the long email and thank you for your time.

Best,
Polykarpos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.argobots.org/pipermail/discuss/attachments/20181101/75109e65/attachment-0003.html>


More information about the discuss mailing list