[argobots-discuss] modifying scheduler event frequency?

Phil Carns carns at mcs.anl.gov
Wed Apr 21 13:18:29 CDT 2021


That's all fantastic Shintaro, thank you for the updates!


-Phil


On 4/21/21 10:39 AM, Iwasaki, Shintaro wrote:
> Hi Phil,
>
> Though you should have already known, I would like to tell you that:
> - The Argobots Spack package supports several new options including 
> stack guard and libunwind settings (see 
> https://github.com/spack/spack/pull/23133 
> <https://github.com/spack/spack/pull/23133>)
> - Argobots now supports mprotect-based stack guard, which causes SEGV 
> when a ULT smashes a stack (see 
> https://github.com/pmodels/argobots/pull/327 
> <https://github.com/pmodels/argobots/pull/327>)
> - This mprotect-based mechanism should work on x86/64, ARM, and POWER 
> machines. I tried Linux (Debian/RedHat), FreeBSD, and Intel-based OSX 
> (see https://github.com/pmodels/argobots/pull/328 
> <https://github.com/pmodels/argobots/pull/328>).
>
> If you have any requests, suggestions, or bug reports, please let us know.
> (I am aware of the Spack issue related to Argobots. I plan to write a 
> quick patch tomorrow: https://github.com/spack/spack/issues/23168 
> <https://github.com/spack/spack/issues/23168>)
>
> Best,
> Shintaro
> ------------------------------------------------------------------------
> *From:* Carns, Philip H. <carns at mcs.anl.gov>
> *Sent:* Wednesday, April 14, 2021 4:01 PM
> *To:* Iwasaki, Shintaro <siwasaki at anl.gov>; discuss at lists.argobots.org 
> <discuss at lists.argobots.org>
> *Subject:* Re: [argobots-discuss] modifying scheduler event frequency?
>
>
> On 4/14/21 4:55 PM, Iwasaki, Shintaro wrote:
>> Hi Phil,
>>
>> Thanks. I can understand a bigger picture.
>>
>> > ABT_info_print_thread_stacks_in_pool()
>> I hope it works. Note that print_thread_stacks_in_pool() is not 
>> async-signal safe (ABT_info_trigger_print_all_thread_stacks() is an 
>> exception), so please don't call it in a signal handler.
>
>
> Ok, no problem.  We don't do much via signals in Mochi (almost all of 
> our control capabilities are triggered via RPCs that launch ULTs to do 
> the work).
>
>
>>
>> > We use argobots almost exclusively with spack at this point.
>> Many HPC users use Spack to build dependent libraries. I will add 
>> some debug options (including libunwind, stack guard, ...) as well as 
>> other major options to the Spack Argobots package. We are also 
>> implementing an mprotect-based stack guard option (which is not in 
>> Argobots 1.1, though).
>>
>> Overall, please give us a week or so in total.
>>
>> There is large room for improvement of the debugging/profiling 
>> capability.
>> If you have any questions, requests, and/or suggestions, please feel 
>> free to tell us.
>
>
> Sounds great!  Debuggability seems to be the next frontier for our 
> project, so we'll probably be experimenting with more of these 
> capabilities as time goes on.  It's taken a little while for us to 
> recognize which design/debugging patterns would be most useful.
>
>
> thanks,
>
> -Phil
>
>>
>> Thanks,
>> Shintaro
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Carns, Philip H. <carns at mcs.anl.gov> <mailto:carns at mcs.anl.gov>
>> *Sent:* Wednesday, April 14, 2021 3:42 PM
>> *To:* Iwasaki, Shintaro <siwasaki at anl.gov> <mailto:siwasaki at anl.gov>; 
>> discuss at lists.argobots.org <mailto:discuss at lists.argobots.org> 
>> <discuss at lists.argobots.org> <mailto:discuss at lists.argobots.org>
>> *Subject:* Re: [argobots-discuss] modifying scheduler event frequency?
>>
>> Ah, thanks for the thorough information as always Shintaro :)
>>
>>
>> print_all_thread_stacks() was  tempting because it would potentially 
>> encompass more (in the Mochi use case, it would pick up hypothetical 
>> pools created by higher level components that we don't have a 
>> reference to).  Based on the information in this email thread, 
>> though, I think I'm better off focusing on pools under our control so 
>> that I can use print_thread_stacks_in_pool().  This should work fine; 
>> I was just over-thinking the use case.  The pools are under our own 
>> control in the vast majority of configurations.
>>
>>
>> In the big picture, I was exploring this because of a bug report we 
>> have from one of our collaborators who is getting a nonsensical hang 
>> in a complex scenario that we can't easily reproduce or attach a 
>> debugger to.  I would like to be able to send an RPC to a process at 
>> an arbitrary point in time and dump what it is up to so that we can 
>> understand why it didn't complete something it was trying to do.
>>
>>
>> libunwind sounds great :)  I probably would have been asking about 
>> that next.
>>
>>
>> I guess I'll use this as an opportunity to request/suggest that the 
>> libunwind capability be added as a variant to the argobots spack 
>> package (along with a way to enable future mprotect / stack canary 
>> checks).
>>
>>
>> We use argobots almost exclusively with spack at this point.  Not 
>> that argobots itself is hard to compile manually, but it is often one 
>> of a large number of dependencies that we need to build, so it's best 
>> to just unify them in one packaging system.  It would be 
>> straightforward for us to set up an alternative environment yaml with 
>> various argobots debugging capabilities enabled for 
>> development/debugging purposes.
>>
>>
>> thanks!
>>
>> -Phil
>>
>>
>> On 4/14/21 3:57 PM, Iwasaki, Shintaro wrote:
>>> Hi Phil,
>>>
>>> Thanks for using Argobots!  The following is my answers to your 
>>> questions in addition to some tips.
>>> We would appreciate it if you could share more information about 
>>> your workload and the purpose so that we can give you more specific 
>>> suggestions. Also, we welcome any feature requests and bug reports.
>>>
>>> 1. How to change a scheduler's event frequency?
>>> 1.1. Predefined scheduler
>>> First, there is no way to dynamically change the event frequency 
>>> (even if you hack ABT_sched or a pointer you used in 
>>> ABT_sched_get_data()... since event_freq is loaded to a local variable).
>>> https://github.com/pmodels/argobots/blob/main/src/sched/basic_wait.c#L102 
>>> <https://github.com/pmodels/argobots/blob/main/src/sched/basic_wait.c#L102>
>>> Currently, using a special ABT_sched_config when you create a 
>>> scheduler is the cleanest and the only way to change the event 
>>> frequency.
>>> ```
>>> ABT_sched_config config;
>>> int new_freq = 16; // The default value is 50 
>>> (https://github.com/pmodels/argobots/blob/main/src/arch/abtd_env.c#L13 
>>> <https://github.com/pmodels/argobots/blob/main/src/arch/abtd_env.c#L13>)
>>> ABT_sched_config_create(&config, ABT_sched_basic_freq, 16, 
>>> ABT_sched_config_var_end);
>>> ```
>>> 1.2. Custom scheduler
>>> You can call ABT_xstream_check_events() more frequently after 
>>> calling ABT_info_trigger_print_all_thread_stacks() (e.g., when a 
>>> global flag is on, a scheduler calls ABT_xstream_check_events() in 
>>> every iteration).
>>>
>>> 2. ABT_info_trigger_print_all_thread_stacks()
>>> ABT_info_trigger_print_all_thread_stacks() is designed for 
>>> deadlock/livelock detection, so if your program is just (extremely) 
>>> slow, ABT_info_trigger_print_all_thread_stacks() might not be a 
>>> right routine to try.
>>>
>>> > The first example I tried appeared to essentially defer dump until 
>>> shutdown.
>>> When one of your ULTs encounters a deadlock, the scheduling loop 
>>> might not be called. You might want to set timeout for 
>>> ABT_info_trigger_print_all_thread_stacks(). For example, the 
>>> following test will forcibly print stacks after 3.0 seconds even if 
>>> some execution streams have not reached ABT_xstream_check_events().
>>> https://github.com/pmodels/argobots/blob/main/test/basic/info_stackdump2.c#L30 
>>> <https://github.com/pmodels/argobots/blob/main/test/basic/info_stackdump2.c#L30>
>>> This is dangerous (I mean, it can dump a stack of a running ULT), so 
>>> Argobots does not guarantee anything but it might be helpful to 
>>> understand a deadlock issue sometimes.
>>>
>>> ===
>>>
>>> 3. Some tips
>>> 3.1. gdb
>>> I would use gdb if it would be available to check a 
>>> deadlock/performance issue. For example, if a program looks hanging, 
>>> I will attach a debugger to that process and see what's happening.
>>> 3.2. libunwind for ABT_info_trigger_print_all_thread_stacks()
>>> Unless you are an extremely skillful low-level programmer, I would 
>>> recommend you enable libunwind for better understanding of stacks. 
>>> By default, ABT_info_trigger_print_all_thread_stacks() dumps raw hex 
>>> stack data.
>>> 3.3. "occasionally tied up in system calls"
>>> I'm not sure if it's happening in the Argobots runtime (now Argobots 
>>> uses futex for synchronization on external threads), but if you are 
>>> calling ABT_info_trigger_print_all_thread_stacks() in a signal 
>>> handler, please be aware that system calls terminate (e.g., futex, 
>>> poll, or pthread_cond_wait) if a signal hits the process.
>>> (Argobots synchronization implementation is aware of it and should 
>>> not be affected by an external signal. This property is thoroughly 
>>> tested: 
>>> https://github.com/pmodels/argobots/blob/main/test/util/abttest.c#L245-L287 
>>> <https://github.com/pmodels/argobots/blob/main/test/util/abttest.c#L245-L287>)
>>> Note that the user can call 
>>> ABT_info_trigger_print_all_thread_stacks() on a normal thread 
>>> without any problem. It is implemented just in an async-signal safe 
>>> manner.
>>> 3.4. Stack dump
>>> ABT_info_print_thread_stacks_in_pool() is a less invasive way to 
>>> print stacks, especially if you know a list of pools. It prints 
>>> stacks immediately. Basically, 
>>> ABT_info_trigger_print_all_thread_stacks() sets a flag to call 
>>> ABT_info_print_thread_stacks_in_pool() for all pools after all the 
>>> execution streams stop in ABT_xstream_check_events().
>>>
>>> Thanks,
>>> Shintaro
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Phil Carns via discuss <discuss at lists.argobots.org> 
>>> <mailto:discuss at lists.argobots.org>
>>> *Sent:* Wednesday, April 14, 2021 2:18 PM
>>> *To:* discuss at lists.argobots.org <mailto:discuss at lists.argobots.org> 
>>> <discuss at lists.argobots.org> <mailto:discuss at lists.argobots.org>
>>> *Cc:* Carns, Philip H. <carns at mcs.anl.gov> <mailto:carns at mcs.anl.gov>
>>> *Subject:* [argobots-discuss] modifying scheduler event frequency?
>>>
>>> Hi all,
>>>
>>> Is there a clean way to change a scheduler's event frequency on the fly?
>>>
>>> Browsing the API, I see two possibilities:
>>>
>>>   * set it when the scheduler is first created (using
>>>     ABT_sched_basic_freq?)
>>>   * set it dynamically by manipulating the ABT_sched_get_data()
>>>     pointer, but this seems especially dangerous since the sched
>>>     data struct definition isn't public (i.e. it could cause memory
>>>     corruption if the internal struct def changed)
>>>
>>> For some context (in case there is a different way to go about this 
>>> entirely), I'm trying to figure out how to get 
>>> ABT_info_trigger_print_all_thread_stacks() to print information more 
>>> quickly, which IIUC relies on getting the active schedulers to call 
>>> get_events() sooner.
>>>
>>> I'm happy to add some explicit ABT_thread_yield() shortly after the 
>>> ABT_info_trigger_print_all_thread_stacks() to at least get the 
>>> calling ES to execute it's scheduler loop immediately, but I think 
>>> that won't matter much if it doesn't trip the frequency counter when 
>>> I do it.
>>>
>>> Without this (at least with the _wait scheduler and threads that are 
>>> occasionally tied up in system calls) I think the stack dump is 
>>> likely to trigger too late to display what I'm hoping to capture 
>>> when I call it. The first example I tried appeared to essentially 
>>> defer dump until shutdown.
>>>
>>> thanks!
>>>
>>> -Phil
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.argobots.org/pipermail/discuss/attachments/20210421/2840e644/attachment-0001.html>


More information about the discuss mailing list