<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>This is fantastic, Shintaro.  That sounds like exactly what I was

      hoping for :)</p>

    <p>We have a benchmark that I think will work well for isolating

      this behavior; we'll try some experiments and let you know what we

      find.  I'll modify our benchmark to track memory consumption first

      so that we have the metric ready when we do parameter sweeps.<br>

    </p>

    <p>We do actually already constrain the stack cache size (this was

      necessary early on for us because of the large stacks we use), so

      we should be all set there.  We also have a custom pool that

      prioritizes completing existing ULTs before presenting new ones to

      the scheduler.  I think that might help us get a little more

      benefit out of the lazy stack allocation as well.<br>

    </p>

    <p>If this proves to be helpful for our workload, is this something

      that could plausibly be a run-time rather than compile-time

      option?</p>

    <p>thank you!</p>

    <p>-Phil<br>

    </p>

    <div class="moz-cite-prefix">On 6/16/22 7:43 PM, Shintaro Iwasaki

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CALh=OcN5_3pU7WaQkyscsExq0dYMMSGZtbOMi09Sk2ihk7WmKQ@mail.gmail.com">

      

      <div dir="ltr">Hi Phil,

        <div><br>

        </div>

        <div>Thanks for using Argobots! I believe it's about memory

          consumption issues regarding ULT stacks.</div>

        <div><br>

        </div>

        <div>> What would be ideal for me would be if

          ABT_thread_create() would defer stack allocation somehow.<br>

          I believe <a href="https://github.com/pmodels/argobots/pull/356" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/pmodels/argobots/pull/356</a>

          (merged) exactly does this.  This configuration is disabled by

          default, so please set --enable-lazy-stack-alloc at configure

          time.<br>

          <br>

          [Background]<br>

          Argobots needs to keep</div>

        <div>- "full stacks [*1]" (in this case, 2MB) per "active"

          (i.e., "executing" + "suspending") ULT</div>

        <div>Intuitively, Argobots must have a full ULT stack to save an

          intermediate ULT execution state, in addition to a stack space

          for a currently executing ULT.  This is the minimum stack

          requirement for Argobots.<br>

          [*1] There was a long discussion in <a href="https://github.com/pmodels/argobots/issues/274" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/pmodels/argobots/issues/274</a>,

          but basically it's not possible to allocate small stack first

          and expand it later within Argobots)<br>

          <br>

          [Ideas]<br>

          A ULT stack is assigned when a ULT is executed (not created). 

          The stack is reclaimed when a ULT is finished (not freed). 

          This can achieve the minimum stack use calculated based on

          [Background].  See the PR for details.  The PR explains it

          using some figures.<br>

          <br>

          [Reduce More]<br>

          1. This does not include the ULT stack pool (=cache), so if

          you want to further reduce memory usage, please shrink the

          stack pool size.  This pool mechanism just increases the

          constant amount of memory consumption, so this pool cache

          won't affect the memory footprint much, I believe.  Shrinking

          this can negatively affect the performance.<br>

          2. Even if you allocate a stack in this way, still you need

          2MB per "suspended ULT".  If most of the ULTs launch and then

          immediately yield, this "enable-lazy-stack-alloc" method does

          not reduce memory consumption.  If you need to immediately

          yield, instead of yielding, please create a new ULT for

          continuation and exit the ULT; if so, Argobots does not need

          to save a full ULT stack per yielded ULT. (A newly created ULT

          does not have a ULT stack since it has not started yet).</div>

        <div><br>

          ---<br>

          <br>

          I might not fully understand the use case, but hopefully this

          flag helps.  Please let me know if you have any questions or

          suggestions.<br>

        </div>

        <div><br>

        </div>

        <div>Thanks,</div>

        <div>Shintaro</div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Jun 16, 2022 at 1:57

          PM Phil Carns via discuss <<a href="mailto:discuss@lists.argobots.org" moz-do-not-send="true" class="moz-txt-link-freetext">discuss@lists.argobots.org</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi

          all,<br>

          <br>

          I was rummaging around in the code looking for ideas just now

          and <br>

          figured I might save myself some time by asking on the list to

          see if <br>

          anyone else has encountered this.<br>

          <br>

          A quick review of the use case: we are using large stack sizes

          (2 MiB <br>

          right now, though we could probably go lower but it will still

          be much <br>

          larger than the ABT default).  We also create, execute, and

          complete a <br>

          large number of detached ULTs.  Only a very few are

          intentionally long <br>

          lived.<br>

          <br>

          Our current strategy is that a central producer (who drives

          network <br>

          progress) creates ULTs that may be placed on other pools/ESs

          depending <br>

          on configuration.<br>

          <br>

          I had *thought* that the ULT stacks were not allocated until

          the ULT was <br>

          selected for execution by a scheduler, but I see now that's

          not the <br>

          case.  The stack is allocated up front at ABT_thread_create()

          time.  I'm <br>

          kicking myself for not understanding that sooner.  It didn't

          matter so <br>

          much when we used to use small stack sizes.<br>

          <br>

          At any rate, at this point this strategy has a few

          implications. If the <br>

          ES schedulers don't retire old ULTs fast enough (even if they

          are very <br>

          "close" to completion) then we can balloon memory consumption

          even if it <br>

          doesn't look like our actual concurrency is all that high,

          simply <br>

          because we are greedily taking more memory for stacks without

          regard to <br>

          ULT completion.  Secondly, the one producer is always paying

          the <br>

          allocation cost, and the memory is always local to that one

          core.<br>

          <br>

          What would be ideal for me would be if ABT_thread_create()

          would defer <br>

          stack allocation somehow.  Ideally not consuming so much

          memory for a <br>

          thread until a) it can really be executed and b) the scheduler

          thinks it <br>

          is a good idea to do so.  Even better if the the allocation

          were in the <br>

          context of the ES that popped the thread, rather than the ES

          that <br>

          spawned the thread.<br>

          <br>

          Is this possible?<br>

          <br>

          It would be neat if this could be done internal to Argobots

          somehow for <br>

          generality for my use case, but walking through the code I

          have the <br>

          sinking feeling that we need to do this above Argobots

          (explicitly <br>

          queueing up work and letting the "worker" execution streams

          create their <br>

          own ULTs to perform that work a needed, rather than letting

          the ULT <br>

          pools within Argobots serve double duty as our work queue).<br>

          <br>

          I'm comfortable with custom pools and schedulers, but it looks

          like the <br>

          key step is already out of our hands at ULT creation time so

          there isn't <br>

          much a custom pool or scheduler could do.<br>

          <br>

          Thanks for hearing me out, and thanks in advance for feedback

          (even if <br>

          it takes the form of "that's a silly idea" :) ).<br>

          <br>

          thanks,<br>

          <br>

          -Phil<br>

          <br>

          <br>

          <br>

          _______________________________________________<br>

          discuss mailing list<br>

          <a href="mailto:discuss@lists.argobots.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">discuss@lists.argobots.org</a><br>

          <a href="https://lists.argobots.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://lists.argobots.org/mailman/listinfo/discuss</a><br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>