[argobots-discuss] scheduling with priority for resumed ULTs

Thu Aug 13 13:58:41 CDT 2020

Apparently when I said "next week" what I really meant was "next month" 🙂

Anyhow, here is a slightly cleaned up version integrated into an experimental branch of margo for reference:

https://xgitlab.cels.anl.gov/sds/margo/-/merge_requests/29

I made one functional change, which was to turn the flag indicating if a unit had been executed or not into a counter.  This was because I realized that I didn't want persistent (forever running) ULTs to always get a priority boost in the Margo case.  We have at least one ULT running that drives Mercury progress, and sometimes extra ones to monitor various performance metrics.  This version hacks around this by only putting things in the high priority queue in the pool if they have been executed between 1 and THRESHOLD number of times.

The code still needs to be cleaned up a little more, but it's sufficient for testing in the above form.  Unfortunately it didn't make any measurable difference for a pet performance tuning problem I've been working on (I believe it has other limiting factors).  We have a mechanism to swap this build in and out in spack configurations now, though, so we'll be able to try it in some other scenarios in the future.

Thanks again for the example code!

thanks,
-Phil

________________________________
From: Carns, Philip H. <carns at mcs.anl.gov>
Sent: Friday, July 10, 2020 8:46 AM
To: Iwasaki, Shintaro <siwasaki at anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

Neat- thank you Shintaro!

I think this would accomplish what I was asking about.  I'll see if I can try it out with some data service examples next week.

thanks,
-Phil
________________________________
From: Iwasaki, Shintaro <siwasaki at anl.gov>
Sent: Thursday, July 9, 2020 6:15 PM
To: Carns, Philip H. <carns at mcs.anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

 This would be done at pool creation time by passing a pointer to a cond_t into the void* data for the pool, right?
Yes. In the current implementation, ABT_pool_config (actually a pointer) should be used for this purpose: it is directly passed to pool_init().

FWIW, I implemented a priority queue we discussed (this solution is classified as single execution stream + one pool + ABT_pool_pop_timedwait()).
Please find the code attached. This might be helpful.
The implementation is not very sophisticated, so you might want to improve it with some application-specific knowledge.

gcc -std=c99 example_prio.c -lpthread -labt
./a.out -c 0 (custom_pool is disabled)
=> execution is FIFO-order
[ES 1] execute tid = 0 (0th yield)
[ES 1] execute tid = 1 (0th yield)
[ES 1] execute tid = 2 (0th yield)
[ES 1] execute tid = 3 (0th yield)
[ES 1] execute tid = 0 (after 1st yield)
[ES 1] execute tid = 1 (after 1st yield)
[ES 1] execute tid = 2 (after 1st yield)
[ES 1] execute tid = 3 (after 1st yield)
[ES 1] execute tid = 0 (after 2nd yield)
[ES 1] execute tid = 1 (after 2nd yield)

./a.out -c 1 (custom_pool is enabled)
=> yielded threads (or threads that have been already scheduled once) are prioritized
[ES 1] execute tid = 0 (0th yield)
[ES 1] execute tid = 0 (after 1st yield)
[ES 1] execute tid = 0 (after 2nd yield)
[ES 1] execute tid = 1 (0th yield)
[ES 1] execute tid = 1 (after 1st yield)
[ES 1] execute tid = 1 (after 2nd yield)
[ES 1] execute tid = 2 (0th yield)
[ES 1] execute tid = 2 (after 1st yield)
[ES 1] execute tid = 2 (after 2nd yield)
...

If you encounter any implementation/performance issues, please let us know.

Thanks,
Shintaro Iwasaki

________________________________
From: Carns, Philip H. <carns at mcs.anl.gov>
Sent: Thursday, July 9, 2020 12:40 PM
To: Iwasaki, Shintaro <siwasaki at anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

Right, this makes sense, but I was referring to how you set up the old_thread_pool and new_thread_pool initially, so that they both know to signal the same condition variable.  This would be done at pool creation time by passing a pointer to a cond_t into the void* data for the pool, right?

This would actually probably be fine.  We could do this in our own components that already set non-default schedulers without the user seeing the difference.  If a data service defined its own pools for some other purpose they would make their own choice about what scheduler to use.

thanks,
-Phil
________________________________
From: Iwasaki, Shintaro <siwasaki at anl.gov>
Sent: Thursday, July 9, 2020 10:13 AM
To: Carns, Philip H. <carns at mcs.anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

Good morning, Phil,

The difference between ABT_key_set() and ABT_thread_set_specific()
You're right.  ABT_key_set() is for the currently running work unit (something like "ABT_self_set_specific()") while ABT_thread_set_specific() can target any work unit as an argument.  ABT_thread_set_specific() does not exist in Pthreads (ABT_key_set() is corresponding to pthread_set_specific()).

Note that the overall performance of ABT_key (especially new entry creation) was significantly improved after 1.0 release. The rough performance overheads of these operations are here: https://github.com/pmodels/argobots/pull/201.

The "multiple pool one ES" approach
This can be implemented in a scheduler, so I believe the user interface will be the same.
This idea needs special user-defined pools, though. More complete idea is:

sched_run():
  /* new_thread_pool and old_thread_pool use user-defined pools that
   * share the same pthread_mutex_t and pthread_cond_t. */
  int work_count = 0;
  while (1) {
    if (unit = ABT_pool_pop_wait(old_thread_pool, 0.1[s])) {
      /* Prioritize resumed/yielded threads */
      ABT_xstream_run_unit(unit, old_thread_pool);
      /* Every 128 iterations, check new_thread_pool to avoid a deadlock
       * (see the previous mail). */
      if (work_count++ % 128 != 0) continue;
    }
    if (unit = ABT_pool_pop_wait(new_thread_pool, 0.1[s])) {
      ABT_unit_set_associated_pool(unit, old_thread_pool);
      /* Threads are moved to old_thread_pool, so if this "unit" suspends or yields, it is
       * pushed to old_thread_pool, which will be prioritized over new threads. */
      ABT_xstream_run_unit(unit, old_thread_pool);
    }
  }

/* creation */
ABT_thread_create(... , new_thread_pool, ... , &thread);
/* synchronization */
ABT_eventual_wait(...); /* no change */

Thanks,
Shintaro

________________________________
From: Carns, Philip H. <carns at mcs.anl.gov>
Sent: Thursday, July 9, 2020 8:00 AM
To: Iwasaki, Shintaro <siwasaki at anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

Thanks for the  discussion Shintaro!

Just to clarify, what is the difference between ABT_key_set() and ABT_key_set_specific()?  Do they do the same thing, except that the former lets you target a ULT other than the currently running one?

(FWIW, we use keys in Mochi to carry tracing identifiers across processes that relay RPC messages so that we can look at service dependencies.  It's been a nice feature for us in that context, and we've been able to tolerate the key function overheads Ok so far.)

For the "multiple pool one ES" approach, would the idea be for the caller (atop Argobots) to set up the shared condition variable using the {get|set}_data functions for the pools?  We did something similar in an early prototype of the _wait pool/scheduler external to Argobots (before the "wait" variants of the pop functions were added), but it was a little bit awkward for users.

I like the design cleanliness of the "multiple ES" strategy, but we probably would not try that.  Some of the network/disk resources we interact are sensitive to contention and context switches if we interact with them from multiple execution streams, so I would be hesitant to make it required in our stack.

thanks,
-Phil
________________________________
From: Iwasaki, Shintaro <siwasaki at anl.gov>
Sent: Wednesday, July 8, 2020 10:20 AM
To: Carns, Philip H. <carns at mcs.anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

Hello Phil,

Thank you. I understand the situation more. In my understanding, all the following options can be implemented in the current Argobots. Some are less invasive while others are easy to implement.

- Single execution stream + one pool + ABT_pool_pop_timedwait()
For that to work, though, the pool implementation would need to be able to inspect the ABT_unit at push() time and tell whether it is a newly created thread or a resumed thread so that it could track them separately.
Without https://github.com/pmodels/argobots/issues/154, one can use a flag stored in a user-created descriptor corresponding to the thread to check if that thread has been already executed or not.  A hash table is a general solution, but it would be heavy. In some applications, such descriptor can be obtained via ABT_thread_get_arg().

Another way is to use a ULT-specific value (e.g., ABT_thread_get_specific()) to manage such a flag.  A quick hack is using `ABT_thread_set_arg()` and `ABT_thread_get_arg()` to manage an execution flag, which may be faster than ABT_thread_set_specific() and ABT_thread_get_specific() in the current Argobots implementation (related to https://github.com/pmodels/argobots/issues/159).

This idea is less invasive, but implementing a correct and reasonably scalable pool with flag management might not be an easy task.

- Single execution stream + multiple pools + ABT_pool_pop_timedwait()

Even if you use two pools (for example, the scheduler I suggested in the previous mail), it should work well if these two pools (old-thread-pool and new-thread-pool) share the same Pthreads mutex/condition variable. This change of the pool implementation can be minimum.

- Multiple execution streams + each has one pool + ABT_pool_pop_timedwait()

The easiest way that does not change the pool implementation is using multiple execution streams: some for newly created threads (these execution streams only check new-thread-pool) and the others for suspended threads (these execution streams only check old-thread-pool). The oversubscription cost should not be very high if these execution streams are sleeping immediately when no work is available. This does not need to change the pool implementation but is very invasive.

I would also like to note that, presently there is no good example of custom pool implementation in Argobots, so it is hard to show how to implement it in a reasonably scalable way.  I will add a reasonable example in one to two days, which might be helpful.

Thanks,
Shintaro

________________________________
From: Carns, Philip H. <carns at mcs.anl.gov>
Sent: Wednesday, July 8, 2020 7:59 AM
To: Iwasaki, Shintaro <siwasaki at anl.gov>; discuss at lists.argobots.org <discuss at lists.argobots.org>
Subject: Re: scheduling with priority for resumed ULTs

That's interesting.

For us the issue of how to block on two pools would be a problem.  I don't think we have an application-specific rules that would help; either pool could receive new work while the scheduler is blocked on a pop.

Something along the lines of #154 that would allows some control over what happens within the pool data structure would be helpful, but it's not a high priority.

In the meantime (since our use case is so simple) I wonder if we could do something within the confines of the current pool interface.  The linked list pointers are not exposed to the caller (right?), so nothing is stopping a pool from maintaining multiple linked lists internally if it wants to.  Multiple work unit queues within a single pool could share a single internal condition/signalling mechanism for blocking pop calls.

For that to work, though, the pool implementation would need to be able to inspect the ABT_unit at push() time and tell whether it is a newly created thread or a resumed thread so that it could track them separately.

Is there any way to do that?  It might not be a great idea from a software engineering perspective for a pool to dig too deep into the unit or thread data structures, but if there were something in there that could indicate if a thread had ever been run or not, then we could hack it as a proof of concept to see if it makes a performance difference before spending time on something more invasive.

thanks,
-Phil

________________________________
From: Iwasaki, Shintaro <siwasaki at anl.gov>
Sent: Tuesday, July 7, 2020 6:44 PM
To: discuss at lists.argobots.org <discuss at lists.argobots.org>
Cc: Carns, Philip H. <carns at mcs.anl.gov>
Subject: Re: scheduling with priority for resumed ULTs

Hello, Phil,

Thank you for your excellent question.  The current Argobots does not provide a very straightforward way.

1. The simplest idea

In my opinion, the easiest way should be one that uses two pools, new-thread-pool and old-thread-pool.
The new threads/tasklets are pushed to one of new-thread-pools.  The user-defined scheduler looks like following:

sched_run():
  while (1) {
    if (unit = ABT_pool_pop(old_thread_pool)) {
      /* Prioritize resumed/yielded threads */
      ABT_xstream_run_unit(unit, old_thread_pool);
      continue;
    }
    if (unit = ABT_pool_pop(new_thread_pool)) {
      ABT_unit_set_associated_pool(unit, old_thread_pool);
      /* Threads are moved to old_thread_pool, so if this "unit" suspends or yields, it is
       * pushed to old_thread_pool, which will be prioritized over new threads. */
      ABT_xstream_run_unit(unit, old_thread_pool);
    }
  }

However, this scheduler may cause a deadlock with a certain dependency.  For example, thread2 is never scheduled forever since thread1 is in old_thread_pool.

g_flag = 0;
void thread1() {
  ABT_thread_create(thread2, ... new_thread_pool); /* newly created thread is pushed to new_thread_pool */
  while (g_flag == 0)
    ABT_thread_yield(); /* thread1 was associated with old_thread_pool when thread1 was scheduled for the first time. */
}
void thread2() {
  g_flag = 1;
}

To avoid this, the scheduler can sometimes check and run threads in new_thread_pool (for example, every N iterations).

2. Does it work with ABT_pool_pop_timedwait() (i.e., ABT_POOL_FIFO_WAIT)?

ABT_pool_pop_timedwait() only takes a single pool; users cannot timed-wait for multiple pools.  Consider using ABT_pool_pop_timedwait() instead of ABT_pool_pop() in the scheduler I mentioned above.  In general, a scheduler can timed-wait (= sleep) for either old_thread_pool or new_thread_pool even though the other pool has threads.  If there is application-specific knowledge (e.g., old_thread_pool can be empty only when new_thread_pool is empty etc), ABT_pool_pop_timedwait() + the scheduling strategy above is a good idea, though.

For now, there is no general solution.  One idea is using more execution streams: some ESs are dedicated to new-thread-pool while the other ESs to old-thread-pool.  If they sleep in ABT_pool_pop_timedwait(), the performance penalty of oversubscription etc should be small.

Creating a customized pool is another way (e.g., marking a thread when it is scheduled for the first time and manages newly created threads and suspend threads separately in a pool), but it is complicated.

The fundamental solution should be allowing different pool operations corresponding to yield/create/suspend/... (e.g., push to the head of the list on creation but pushed to the tail of the list on suspension: https://github.com/pmodels/argobots/issues/154), but it is under development.  If this option is the most promising, I will prioritize this.

If you have any questions, please let us know.

Thanks,
Shintaro

________________________________
From: Carns, Philip H. via discuss <discuss at lists.argobots.org>
Sent: Tuesday, July 7, 2020 4:40 PM
To: discuss at lists.argobots.org <discuss at lists.argobots.org>
Cc: Carns, Philip H. <carns at mcs.anl.gov>
Subject: [argobots-discuss] scheduling with priority for resumed ULTs

Hi all,

I thought this question may be of general interest so I am asking on the mailing list.

My understanding is that the default pool/scheduler combination uses FIFO ordering.  Suppose we wanted to try a slight variation: FIFO ordering, but with resumed ULTs always taking priority over new ULTs that have not yet begun execution.

The use case for this would be for a data service to expedite requests that are already in progress (and were suspended while waiting on disk or network activity) to try to get them out of the system before starting to process new requests, assuming that there is work available in either category.  We create a new ULT for every incoming request.  Under heavy client process load it is plausible that the final step(s) of servicing an existing request could be delayed behind newly incoming requests, but we haven't empirically confirmed yet.

What would be the easiest way to accomplish this?  I think I can find a way to do it, but it probably would not be the cleverest solution 🙂

FWIW we usually use ABT_POOL_FIFO_WAIT and ABT_SCHED_BASIC_WAIT rather than the default pool and scheduler, but I don't think that should change anything.  They are based on the default pool and scheduler and only differ in terms of their idle behavior.

thanks!
-Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.argobots.org/pipermail/discuss/attachments/20200813/caed982a/attachment-0001.html>