[argobots-discuss] modifying scheduler event frequency?
Phil Carns
carns at mcs.anl.gov
Wed Apr 21 13:18:29 CDT 2021
That's all fantastic Shintaro, thank you for the updates!
-Phil
On 4/21/21 10:39 AM, Iwasaki, Shintaro wrote:
> Hi Phil,
>
> Though you should have already known, I would like to tell you that:
> - The Argobots Spack package supports several new options including
> stack guard and libunwind settings (see
> https://github.com/spack/spack/pull/23133
> <https://github.com/spack/spack/pull/23133>)
> - Argobots now supports mprotect-based stack guard, which causes SEGV
> when a ULT smashes a stack (see
> https://github.com/pmodels/argobots/pull/327
> <https://github.com/pmodels/argobots/pull/327>)
> - This mprotect-based mechanism should work on x86/64, ARM, and POWER
> machines. I tried Linux (Debian/RedHat), FreeBSD, and Intel-based OSX
> (see https://github.com/pmodels/argobots/pull/328
> <https://github.com/pmodels/argobots/pull/328>).
>
> If you have any requests, suggestions, or bug reports, please let us know.
> (I am aware of the Spack issue related to Argobots. I plan to write a
> quick patch tomorrow: https://github.com/spack/spack/issues/23168
> <https://github.com/spack/spack/issues/23168>)
>
> Best,
> Shintaro
> ------------------------------------------------------------------------
> *From:* Carns, Philip H. <carns at mcs.anl.gov>
> *Sent:* Wednesday, April 14, 2021 4:01 PM
> *To:* Iwasaki, Shintaro <siwasaki at anl.gov>; discuss at lists.argobots.org
> <discuss at lists.argobots.org>
> *Subject:* Re: [argobots-discuss] modifying scheduler event frequency?
>
>
> On 4/14/21 4:55 PM, Iwasaki, Shintaro wrote:
>> Hi Phil,
>>
>> Thanks. I can understand a bigger picture.
>>
>> > ABT_info_print_thread_stacks_in_pool()
>> I hope it works. Note that print_thread_stacks_in_pool() is not
>> async-signal safe (ABT_info_trigger_print_all_thread_stacks() is an
>> exception), so please don't call it in a signal handler.
>
>
> Ok, no problem. We don't do much via signals in Mochi (almost all of
> our control capabilities are triggered via RPCs that launch ULTs to do
> the work).
>
>
>>
>> > We use argobots almost exclusively with spack at this point.
>> Many HPC users use Spack to build dependent libraries. I will add
>> some debug options (including libunwind, stack guard, ...) as well as
>> other major options to the Spack Argobots package. We are also
>> implementing an mprotect-based stack guard option (which is not in
>> Argobots 1.1, though).
>>
>> Overall, please give us a week or so in total.
>>
>> There is large room for improvement of the debugging/profiling
>> capability.
>> If you have any questions, requests, and/or suggestions, please feel
>> free to tell us.
>
>
> Sounds great! Debuggability seems to be the next frontier for our
> project, so we'll probably be experimenting with more of these
> capabilities as time goes on. It's taken a little while for us to
> recognize which design/debugging patterns would be most useful.
>
>
> thanks,
>
> -Phil
>
>>
>> Thanks,
>> Shintaro
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Carns, Philip H. <carns at mcs.anl.gov> <mailto:carns at mcs.anl.gov>
>> *Sent:* Wednesday, April 14, 2021 3:42 PM
>> *To:* Iwasaki, Shintaro <siwasaki at anl.gov> <mailto:siwasaki at anl.gov>;
>> discuss at lists.argobots.org <mailto:discuss at lists.argobots.org>
>> <discuss at lists.argobots.org> <mailto:discuss at lists.argobots.org>
>> *Subject:* Re: [argobots-discuss] modifying scheduler event frequency?
>>
>> Ah, thanks for the thorough information as always Shintaro :)
>>
>>
>> print_all_thread_stacks() was tempting because it would potentially
>> encompass more (in the Mochi use case, it would pick up hypothetical
>> pools created by higher level components that we don't have a
>> reference to). Based on the information in this email thread,
>> though, I think I'm better off focusing on pools under our control so
>> that I can use print_thread_stacks_in_pool(). This should work fine;
>> I was just over-thinking the use case. The pools are under our own
>> control in the vast majority of configurations.
>>
>>
>> In the big picture, I was exploring this because of a bug report we
>> have from one of our collaborators who is getting a nonsensical hang
>> in a complex scenario that we can't easily reproduce or attach a
>> debugger to. I would like to be able to send an RPC to a process at
>> an arbitrary point in time and dump what it is up to so that we can
>> understand why it didn't complete something it was trying to do.
>>
>>
>> libunwind sounds great :) I probably would have been asking about
>> that next.
>>
>>
>> I guess I'll use this as an opportunity to request/suggest that the
>> libunwind capability be added as a variant to the argobots spack
>> package (along with a way to enable future mprotect / stack canary
>> checks).
>>
>>
>> We use argobots almost exclusively with spack at this point. Not
>> that argobots itself is hard to compile manually, but it is often one
>> of a large number of dependencies that we need to build, so it's best
>> to just unify them in one packaging system. It would be
>> straightforward for us to set up an alternative environment yaml with
>> various argobots debugging capabilities enabled for
>> development/debugging purposes.
>>
>>
>> thanks!
>>
>> -Phil
>>
>>
>> On 4/14/21 3:57 PM, Iwasaki, Shintaro wrote:
>>> Hi Phil,
>>>
>>> Thanks for using Argobots! The following is my answers to your
>>> questions in addition to some tips.
>>> We would appreciate it if you could share more information about
>>> your workload and the purpose so that we can give you more specific
>>> suggestions. Also, we welcome any feature requests and bug reports.
>>>
>>> 1. How to change a scheduler's event frequency?
>>> 1.1. Predefined scheduler
>>> First, there is no way to dynamically change the event frequency
>>> (even if you hack ABT_sched or a pointer you used in
>>> ABT_sched_get_data()... since event_freq is loaded to a local variable).
>>> https://github.com/pmodels/argobots/blob/main/src/sched/basic_wait.c#L102
>>> <https://github.com/pmodels/argobots/blob/main/src/sched/basic_wait.c#L102>
>>> Currently, using a special ABT_sched_config when you create a
>>> scheduler is the cleanest and the only way to change the event
>>> frequency.
>>> ```
>>> ABT_sched_config config;
>>> int new_freq = 16; // The default value is 50
>>> (https://github.com/pmodels/argobots/blob/main/src/arch/abtd_env.c#L13
>>> <https://github.com/pmodels/argobots/blob/main/src/arch/abtd_env.c#L13>)
>>> ABT_sched_config_create(&config, ABT_sched_basic_freq, 16,
>>> ABT_sched_config_var_end);
>>> ```
>>> 1.2. Custom scheduler
>>> You can call ABT_xstream_check_events() more frequently after
>>> calling ABT_info_trigger_print_all_thread_stacks() (e.g., when a
>>> global flag is on, a scheduler calls ABT_xstream_check_events() in
>>> every iteration).
>>>
>>> 2. ABT_info_trigger_print_all_thread_stacks()
>>> ABT_info_trigger_print_all_thread_stacks() is designed for
>>> deadlock/livelock detection, so if your program is just (extremely)
>>> slow, ABT_info_trigger_print_all_thread_stacks() might not be a
>>> right routine to try.
>>>
>>> > The first example I tried appeared to essentially defer dump until
>>> shutdown.
>>> When one of your ULTs encounters a deadlock, the scheduling loop
>>> might not be called. You might want to set timeout for
>>> ABT_info_trigger_print_all_thread_stacks(). For example, the
>>> following test will forcibly print stacks after 3.0 seconds even if
>>> some execution streams have not reached ABT_xstream_check_events().
>>> https://github.com/pmodels/argobots/blob/main/test/basic/info_stackdump2.c#L30
>>> <https://github.com/pmodels/argobots/blob/main/test/basic/info_stackdump2.c#L30>
>>> This is dangerous (I mean, it can dump a stack of a running ULT), so
>>> Argobots does not guarantee anything but it might be helpful to
>>> understand a deadlock issue sometimes.
>>>
>>> ===
>>>
>>> 3. Some tips
>>> 3.1. gdb
>>> I would use gdb if it would be available to check a
>>> deadlock/performance issue. For example, if a program looks hanging,
>>> I will attach a debugger to that process and see what's happening.
>>> 3.2. libunwind for ABT_info_trigger_print_all_thread_stacks()
>>> Unless you are an extremely skillful low-level programmer, I would
>>> recommend you enable libunwind for better understanding of stacks.
>>> By default, ABT_info_trigger_print_all_thread_stacks() dumps raw hex
>>> stack data.
>>> 3.3. "occasionally tied up in system calls"
>>> I'm not sure if it's happening in the Argobots runtime (now Argobots
>>> uses futex for synchronization on external threads), but if you are
>>> calling ABT_info_trigger_print_all_thread_stacks() in a signal
>>> handler, please be aware that system calls terminate (e.g., futex,
>>> poll, or pthread_cond_wait) if a signal hits the process.
>>> (Argobots synchronization implementation is aware of it and should
>>> not be affected by an external signal. This property is thoroughly
>>> tested:
>>> https://github.com/pmodels/argobots/blob/main/test/util/abttest.c#L245-L287
>>> <https://github.com/pmodels/argobots/blob/main/test/util/abttest.c#L245-L287>)
>>> Note that the user can call
>>> ABT_info_trigger_print_all_thread_stacks() on a normal thread
>>> without any problem. It is implemented just in an async-signal safe
>>> manner.
>>> 3.4. Stack dump
>>> ABT_info_print_thread_stacks_in_pool() is a less invasive way to
>>> print stacks, especially if you know a list of pools. It prints
>>> stacks immediately. Basically,
>>> ABT_info_trigger_print_all_thread_stacks() sets a flag to call
>>> ABT_info_print_thread_stacks_in_pool() for all pools after all the
>>> execution streams stop in ABT_xstream_check_events().
>>>
>>> Thanks,
>>> Shintaro
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Phil Carns via discuss <discuss at lists.argobots.org>
>>> <mailto:discuss at lists.argobots.org>
>>> *Sent:* Wednesday, April 14, 2021 2:18 PM
>>> *To:* discuss at lists.argobots.org <mailto:discuss at lists.argobots.org>
>>> <discuss at lists.argobots.org> <mailto:discuss at lists.argobots.org>
>>> *Cc:* Carns, Philip H. <carns at mcs.anl.gov> <mailto:carns at mcs.anl.gov>
>>> *Subject:* [argobots-discuss] modifying scheduler event frequency?
>>>
>>> Hi all,
>>>
>>> Is there a clean way to change a scheduler's event frequency on the fly?
>>>
>>> Browsing the API, I see two possibilities:
>>>
>>> * set it when the scheduler is first created (using
>>> ABT_sched_basic_freq?)
>>> * set it dynamically by manipulating the ABT_sched_get_data()
>>> pointer, but this seems especially dangerous since the sched
>>> data struct definition isn't public (i.e. it could cause memory
>>> corruption if the internal struct def changed)
>>>
>>> For some context (in case there is a different way to go about this
>>> entirely), I'm trying to figure out how to get
>>> ABT_info_trigger_print_all_thread_stacks() to print information more
>>> quickly, which IIUC relies on getting the active schedulers to call
>>> get_events() sooner.
>>>
>>> I'm happy to add some explicit ABT_thread_yield() shortly after the
>>> ABT_info_trigger_print_all_thread_stacks() to at least get the
>>> calling ES to execute it's scheduler loop immediately, but I think
>>> that won't matter much if it doesn't trip the frequency counter when
>>> I do it.
>>>
>>> Without this (at least with the _wait scheduler and threads that are
>>> occasionally tied up in system calls) I think the stack dump is
>>> likely to trigger too late to display what I'm hoping to capture
>>> when I call it. The first example I tried appeared to essentially
>>> defer dump until shutdown.
>>>
>>> thanks!
>>>
>>> -Phil
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.argobots.org/pipermail/discuss/attachments/20210421/2840e644/attachment-0001.html>
More information about the discuss
mailing list