[argobots-discuss] how to debug a stack overrun in Argobots

Thu Feb 21 16:55:22 CST 2019

Hello All,

Thank you for your reports!

As one of the developers, I would like to summarize the current status of Argobots.
As pointed out, Argobots, by default, uses 16KB for ULTs.
At present, these three ways are relatively reasonable to work around, find, or solve this issue;
0. Use knowledge that, if Argobots is used alone, it should "typically" happens in ABT_finalize. The stack size is set by ABT_thread_attr_set_stacksize individually.
1. Use larger stacksize (e.g., ABT_THREAD_STACKSIZE=$((4 * 1024 * 1024)) for example) and see what happens.
2. [Uncertain] Use Valgrind with --enable-valgrind  (although it is extremely slow, so not practical for large applications)
For now, I think the workaround 1. (using larger stack size by default) is best among them.

I haven't tried other tools, but I strongly believe that Argobots-unaware tools won't detect this problem; only Valgind can detect it.

---

There are several issues:
1. Too small default stack size
It might be too small to drive large system libraries (e.g., a ULT as a progress thread)
I'm not sure how much it should be increased, or first of all, whether we should increase it.
This does not solve the problem of the "silent stack corruption", though.
In other words, if Argobots can detect stack overflow, users can change the value
by increasing the default stack size or the stack size of a specific thread requiring large amount of stack.

2. Lack of stack overflow detection
For example, the following two are often used;
- Stack canaries (lazy but cheap)
- mprotect (eager but expensive)
I will create a GitHub issue for further details if detection is preferable.

3. Check if Valgrind works for this issue
If --enable-valgrind is set, Argobots registers ULT's stacks to Valgrind.
It should work but I haven't tested it yet.
Another problem is that --enable-valgrind degrades performance of Argobots even if it is not run on Valgrind
(see https://github.com/pmodels/argobots/issues/78).

We would appreciate any feedback.

Thank You,
Shintaro Iwasaki

________________________________
From: Lombardi, Johann via discuss <discuss at lists.argobots.org>
Sent: Thursday, February 21, 2019 4:30 PM
To: discuss at lists.argobots.org
Cc: Lombardi, Johann; Liu, Xuezhao; Wang, Di
Subject: Re: [argobots-discuss] how to debug a stack overrun in Argobots

Hi Phil,

I think we hit the same issue recently on the DAOS side and had to bump the stack size as well. Wangdi & Xuezhao should know more.

Maybe a regression in ABT?

Cheers,

Johann

From: "Carns, Philip H. via discuss" <discuss at lists.argobots.org>
Reply-To: "discuss at lists.argobots.org" <discuss at lists.argobots.org>
Date: Thursday, 21 February 2019 at 15:50
To: "discuss at lists.argobots.org" <discuss at lists.argobots.org>
Cc: "Carns, Philip H." <carns at mcs.anl.gov>
Subject: Re: [argobots-discuss] how to debug a stack overrun in Argobots

Just to follow up a little bit; I realized from looking at README.envvar just now that the default value of ABT_THREAD_STACKSIZE is 16K.  That's almost certainly too low for us because we have ULTs that make calls into a variety of system libraries (including fairly big things like libfabric) that are beyond our control.

It seems likely that we will have to run with a larger stack size, but I would still like to have a better understanding of where the problem paths are, and how much head room we really need, if anyone has suggestions.

thanks!
-Phil

On 2019-02-21 15:31:53-05:00 Carns, Philip H. via discuss wrote:

Hi all,

There is a little bit of back story on https://github.com/pmodels/argobots/issues/93 , but make a long story short we have realized that we have some code that is overflowing the stack in Argobots.  Many thanks to Shintaro for his help and insight or we may have never figured this out.  We can work around the problem with `export ABT_THREAD_STACKSIZE=$((1024 * 1024))`.  This not only fixes a Power8 test case for us, but also appears to solve a different frustrating, nonsensical segmentation fault that we've been chasing with a different code permutation on x86_64.

Any suggestions on how to track down what's triggering this in our code or get a better idea of how much stack we need?  We are using a considerable number of libraries, many of which are not maintained by us, so I don't even know where to start looking yet.  My usual go to tool for this would be asan in gcc or clang, but I don't think that will work correctly with Argobots, and maybe there is a better solution anyway.

thanks,
-Phil

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.argobots.org/pipermail/discuss/attachments/20190221/be165776/attachment-0001.html>