Skip to content

ASC Q1 2020 Meeting

Stephen Herbein edited this page Feb 14, 2020 · 11 revisions

PMIx Standard Administrative Steering Committee (ASC) Q1 2020 Meeting

Quick Links

Agenda (timeline in chart below)

Agenda Timeline

We will try to keep to this timeline as best as we can. However, discussion items may take longer/shorter than anticipated and as a result, the agenda may need to be adjusted during the meeting.

All times in US Central. (Last Update: Jan. 10, 2020)

Start End Topic
11:00 am 11:05 Gathering
* Slides
11:05 11:15 2020 Quarterly Meetings
11:15 11:25 Outreach in 2020
11:25 11:30 PMIx 4.0 release timeline
11:30 11:35 Call for new ASC Members
- Roll call for attendance/vote
11:35 11:45 Governance PRs up for a vote
* Reading and First Vote
* https://github.com/pmix/governance/pull/10
* Second Vote
* https://github.com/pmix/governance/pull/4
* https://github.com/pmix/governance/pull/7
11:45 1:00 PMIx Standard Reading:
* Chapter 1 Modifications: [PDF] https://github.com/pmix/pmix-standard/pull/192
1:00 pm 1:30 (break)
1:30 2:30 Plenary discussion items:
* ASC meeting: 1 month agenda announcement followed by 2 week window for changes
- https://github.com/pmix/governance/pull/11
* Use case presentations: link
- Business Card Exchange
- Tools and Debuggers
- Hybrid Programming Models
2:30 3:35 Working Group Updates:
* WG: Client Separation / Implementation Agnostic Document
* WG: Slicing/Grouping of functionality
* WG: Dynamic Workflows
* WG: Storage
* Open call for new Working Groups
3:35 3:45 (break)
Technical Presentations:
3:45 4:00 * Artem Polyakov (Mellanox) : PMIx_Get optimizations link
4:00 4:15 * Aurelien Bouteiller (UTK) : Runtime level failure detection and propagation link
4:15 4:30 * Geoffroy Vallee : On-node resource manager for containerized HPC workloads link
4:30 (end) Discussion Items:

Notes

    Joshua Hursey (IBM)
	Geoffroy Vallee
	Stephen Herbein (LLNL)
	Brice Goglin (Inria) will leave and come back several times
	Jai Dayal (Intel)
	Ken Raffenetti (ANL)
	Swaroop Pophale (ORNL)
	Ralph Castain (Intel)
	Thomas Naughton (ORNL)
	Bengisu Elis (TUM)
	David Solt (IBM)
	John DelSignore (Perforce/TotalView)
	Michael Karo (Altair)
	Sourav Chakraborty (AMD)
	Ti Leggett (ANL)
	Shane Snyder (ANL)
	Kathryn Mohror (LLNL)
	Artem Polyakov (Mellanox)
	Mahdieh (OSU)
	Howard P (LANL)
	Karthik Manian (OSU)
	Aurelien Bouteiller (UTK)        
  • Set dates for 2020 quarterly meetings
    • Q1: Jan 23
    • Q2: April 15
    • Q3: July 22
    • Q4: Oct 1 @ Tacc in Austin
    • Details of Q4 face-to-face meeting logistics in the works
  • Outreach in 2020
    • SC'20 BoF
    • Something at ISC for easier reach by international members?
      • Need someone to lead and organize
    • Does IPDPS have BoFs? What about their workshops?
      • Will be held in the US this year
    • HPDC has workshops and will be outside the US
      • Kathryn: HPDC deadline is coming up
      • Jai: HPDC organizers like more technical talk rather than tutorial-style talks
      • Kathryn: Agreed. Would need to be more invited talks
    • Jai: Would the topic be PMIx or something more broad like Process Management?
      • Kathryn: Organizers typically like broader topics for workshops
      • Ralph: We originally discussed using these events for how to use PMIx and what you can do with it in your application
      • Kathryn: Would we solicit papers and have technical talks?
      • Ralph: May be too soon to solicit papers. Could have talks that walkthrough examples.
      • Kathryn: Could be presentations of already published work using PMIx
      • Josh: If we do a workshop/tutorial, would there be sufficient interest in this group and communities that we represent? Do we want to connect it to another event or do it during the Q4 meeting?
    • Stephen: Could record the presentations (subject to presenter comfort) and accept video submissions
      • Would enable people to view async if they cannot call-in or attend Q4 meeting
  • PMIx 4.0 release timeline
    • Late Q1
    • No additional API changes
    • Mostly waiting on descriptive text around tool interactions
      • Ralph: working on including some of the structure from the use-case document
  • Call for new ASC Members
    • LANL expressed intent, will be voted on next quarterly
  • Roll call for attendance/vote
    • Review voting eligibility
  • Governance PRs up for a vote:
    • Reading and First Vote
      • pmix/governance/pull/10
        • A fast track for administrative and minor typo changes
        • Michael: want to clarify if there is a straw poll taken or not for minor changes
        • Kathryn: this current proposal does not require a straw poll
    • Second Vote
      • pmix/governance/pull/4
      • pmix/governance/pull/7
  • PMIx Standard PRs up for a vote:
    • (None at this time)
  • PMIx Standard PRs up for a Reading:
    • Question from Chairs
      • In the governance document, there is a process for moving interfaces from provisional -> stable etc. There is no guidance for large textual changes, like this PR, in the guidance document. If this PR were to pass reading without major suggested changes, should we hold the first voting in this meeting or next?
      • Michael: suggestion: depending on the amount of content change, members may decide to abstain until they can digest the changes. If there are not enough “yes” votes, then the first vote should take place again at the next meeting.
      • Ralph: what happens if this happens multiple times? If people don’t take the time to read it quarter after quarter.
      • Kathyrn: maybe the abstrain rule only applies the first time
      • Stephen: maybe it is an indication that the PR should be broken up into smaller chunks
      • Ralph: that could a good suggestion from the community. It has happened in the past that people cannot get their changes read.
      • Kathryn: IIUC, abstain would not count towards quorum
      • Ralph: If you abstain, you do not count in the voting, and would not be included in the 2/3rds majority calculation
      • Josh: one category of voting could be “push back voting one meeting”, a “point of order” vote so to speak
      • Ralph: what about the straw poll? Shouldn’t that be a good enough gate to proposals coming to a vote?
      • Kathryn: changes to PR could happen after a straw poll is taken
      • Stephen: is there any guidance to how frequently a straw poll should take place
      • Josh: there isn’t but we can leave that up to chairs discretion. They have to be the ones to put the “eligible” label on it, so they can request another straw poll before placing the label.
      • Kathryn: for second votes, we are not doing a reading?
      • Josh: I don’t believe that you are required to do a reading on second vote, but you should give an idea of the changes
      • Kathryn: we may have to fix some text in the governance doc then. Page 9 says: “a full presentation of the proposal at two consecutive meetings”.
      • Ralph: let’s wordsmith a suggested change offline and submit a PR
      • Dave: we are ok with offering the “point of order” vote today
    • pmix/pmix-standard/pull/192
      • TODO: link to Dave Solt’s slides
      • Stephen and ?: the phrase “the host” is ambiguous, what is it referring to?
      • Ralph: there is an assumption that something is hosting the PMIx server
      • Ralph: people writing the host SMSs don’t want to have to write all the interoperability code to communicate with other SMS software, so PMIx provides interfaces to help enable that
      • Kathryn: “processes registered as tools do not have peers”. Did we define “peer” before this sentence? Maybe that reference/note can be left out until “peer” is formally defined.
      • Dave: we were attempting to point out that tools aren’t the same as other processes connected in
      • Stephen: could potentially add a forward reference to the tools section
      • Dave: will need to verify that the tools section adequately describes the differences
      • Michael: you introduce resource managers first and then introduce workload managers. Are these two the same thing? Might want to shorted subsequent references to RM and WLM
      • Dave: Not sure if we use RM or WLM more frequently, interchangeably, etc
      • Michael: I come across RM more frequently than WLM
      • Ralph: there is no standardized language about what these things mean. We attempted to use WLM to mean the scheduler and RM to mean the runtime underneath the WLM. Could go back to RM meaning both the scheduler and runtime underneath. Where PMIx interfaces with the scheduler, the standard mentions scheduler directly.
      • Dave: I agree. Using RM as an encompassing term sounds like it is the way to go, with a sentence explaining this choice of words.
      • Ralph: RM and other terms are defined in Chapter 2. So the problem may be a chicken/egg with using the terms before they are defined. In the community there is no consensus on what WLM, RM, SMS, or even job mean. Probably need to front load the document with what we mean by these terms
      • Dave: opening the document with a whole set of terms can be awkward
      • Michael: “which” -> “that” in line 5 of section 1.3.1
      • Stephen: MPE vs MPI on line 30
        • Could change to a more general phrase like “messaging passing libraries” or “parallel runtimes”
      • John D: if we have a list of all of the interfaces/attributes that are relevant to debuggers, what to do with that list? Do we just pass that along to every PMIx implementation? Is there a clearing house?
      • Stephen: the slices WG is producing use-case documents that we hope are a community curated version of those lists; it is an open question as to how that information gets integrated into the standard
      • Dave: it’s true that this section doesn’t tell you what to do with the information
      • Josh: do you want to take this feedback and hold off on the reading or push forward with a vote
      • Dave: there is a lot of feedback, much of which is minor changes. The bigger changes being defining RM vs WLM. If we wanted to minimize that change, we could just replace WLM with RM.
      • Stephen: does the PR need to be exactly the same between vote 1 and vote 2
      • Dave: seems like spelling and grammar changes fall under the same umbrella as the typo PR that we voted on today
      • Michael: agreed, but there are some semantic changes that were suggested today that make me feel uncomfortable voting as it is today
      • Josh: suggestion - don’t change the state of the PR (don’t push back, just leave as “read”) between today and the next Quarterly, the WG will address the feedback and post another straw poll, if the straw poll unambiguously then the PR will go up for a straight vote (no reading) in the next quarterly. If there are any conflicting votes in the straw poll, then the PR would go up for a reading again (followed by a vote).
      • Dave: likes that suggestion
      • No objections from the broader group.
      • Dave: there are doc files posted to the PR that show the OLD->NEW and NEW->OLD changes
        • Stephen: one of the links is missing
  • Plenary discussion items
    • ASC meeting - 1 month announcement with 2 week window for changes
      • pmix/governance/pull/11
      • JoshH: Discuss PR#11 about “when is ASC quarterly agenda final for a given meeting?”
        • Suggest 1 month before ASC define agenda, then 2 week window for comments, then agenda is final at 2 weeks before ASC meeting after which things are pushed to next meeting’s agenda
      • JoshH: Reading of PR#11
      • Kathryn: Is the intent that this timeline encompasses content for governance text, PRs, reading, etc. (in addition to the agenda itself)
      • JoshH: Yes, I believe so
      • Kathryn: Seems good to add this point
      • JoshH: OK, will work on adding wording. Intent is to freeze docs/agenda at the 2 week point before the ASC meeting.
      • JoshH: Any objections? -- none.
    • Use case presentations
      • drive.google.com/drive/folders/1eN7aBxyzPD0a_GJFq1KH2ZHpoONj76op
      • Stephen: 3 use cases Business Card Exchange, Tools and Debuggers, Hybrid Programming Models, Working Group Presentations
      • Stephen: Using GoogleDrive and stackedit/markdown to prepare these use cases
      • See Issues with label “Use Cases” and can add comments on those issues
      • Stephen: handing over to Josh and Swaroop (authors of use cases)
      • JoshH: Business Card Exchange for Process-to-Process Wire-up
        • https://github.com/pmix/pmix-standard/issues/191
        • Artem: Question about PMIx_Fence mode being sparse. If synchronization only, then why sparse. (I missed a bit of his detail here)
        • JoshH: Lets explore this a bit more outside of plenary
        • JoahH: Any quesitons about process or structure
        • Artem: Is this to be part of the standard?
        • Stephen: Yes, looking to have these included in standard doc
        • Artem: OpenPMIx has some examples that might be helpful and could be referenced to help clarify with real code.
        • Kathryn: Possibly inline examples in text if fairly simple
        • Artem: May have problems with length, but posssibly
        • Stephen: Please add reference in #191 to example codes. Those will be helpful to clarify things.
        • Ralph: Possibly pair down code for simplicity to include in the standard, possibly even use pseudocode.
        • Artem: Agree putting link in standard may not be great. But having a working example is useful.
        • Ralph: Yes, just need to find way to add those code links/listings
        • Stephen: Possibly use footnotes
        • Will need to decide upon way want to include those working examples
      • Swaroop: Debugging #216
        • https://github.com/pmix/pmix-standard/issues/216
        • Ralph: Regarding marking as “PMIX_DEBUGGER_DAEMONS” attribute, rational is to let schduler know not to count these daemons (processes) toward your allocation. Example: if allowed to run 16 applications on a node, if launch 16+1debugger process, you would get denied.
        • Artem: What does “RM near Tool” and “RM on node” mean?
        • JoshH: Wherever being launched (i.e., launch node) is the “RM near tool”. “RM on node”...
        • Artem: Maybe good to have another diagram for understanding where each of these tools are running to better understand where they are running.
        • Ralph: Possibly be able to grab some pictures from presentation to help clarify
        • John D.: Can you look at a starter process and know if its a PMIx process?
        • Ralph: If in operation… Otherwise could look at symbols to know if using PMIx.
        • JohnD: Working on MPIR transition to PMIx and wanting to avoid having to burden the user.
        • Ralph: Could possibly just check symbols to identify
        • Ralph: Note, in lieu of PMIx_Spawn, you can also use fork/exec directly. An advantage of using PMIx_Spawn could be that if resource manager has good support, e.g., detach, you could take advantage of that so using Spawn allows you to defer details of whether fork/exec is needed to PMIx
        • JohnD: Is there a way to find out where jobs are running, e.g., query for running jobs
        • JoshH: You could query for all the namespaces and then walk through those
        • Would query for Proc_table
        • Swaroop: Possibly add this as a step-0 for this example (attaching) for knowing what’s running for direct attach
        • Ralph: Can launch debugger daemon, e.g., 1-per-process, or other “pattern” for how to attach. (I missed some of these remarks in notes)
        • Stephen: Possible scalability case for using Spawn and have local proc-table query instead of doing global query of all nodes
          • Ralph: you can get the hostnames of the nodes involved in a parallel application using the RESOLVE_NODES attribute with the PMIx_Query_info interface
        • Discussion of steps for exchanging proc-table info for nodes
        • Kathryn: Regarding IO Forwarding, what’s the exact use case here?
        • Its for the stderr/stdout of the application
        • Ralph: Need to work on this b/c PMIx_Log API not for IOF, so need to iterate here
        • Ralph: IOF can pull stdout/stderr from any process under a pmix server instance. Use RMs out-of-band for shipping in/out to pmix managed procs. Caller can pull output from another process, typically intended for Tools but could be used in general.
        • Swaroop: Looks like IOF bit may need to moved or adjusted, also environment bits
        • Stephen: May be useful to have a separate use case on PMIx_Spawn specifically for all the details
      • Use-Case: Hybrid Programming Model
        • https://github.com/pmix/pmix-standard/issues/232
        • Stephen: Coordination for hybrid cases, e.g., MPI+X, via PMIx
        • See PR#232
        • Kathryn: This assumes you can call PMIx_Init mutiple times from a single process, right?
        • Stephen: Yes
        • Aurelien: Can have prog model version on top of library version. Seems like you are missing a version number for the programming model itself. Could be MPI-3 implemented by OpenMPI-2.x
        • Stephen: Are you saying this already exists in standard?
        • Aurelien: Not sure, but saying would be good
        • Aurelien: processor-mask (placement on hardware) info is useful for coordinating runtimes to know what processors are in use/free
        • Kathryn: Martin Shultz might have some thoughts on interoperability of OpenMP + MPI
        • Ralph: Thought there were supposed to be a few other events, e.g., the processor-mask mentioned above, to help with coordination. Also, things like MPI progress events.
        • Aurielien: Think Geoffroy had paper on that. Will look for things.
        • Geoffroy: PGAS+MPI use case design. Also, about OpenMP+MPI coordination events
        • Stephen: Would be good to get a short example for that use-case would be great
    • Client Separation / Implementation Agnostic Document Working Group
      • David: Been working on Ch1 (reading earlier in agenda) and starting on Ch2. (See slides for Ch2 change summary)
      • Looking at PDF of Ch2 changes, see also: https://github.com/pmix/pmix-standard/pull/235
      • JoshH: Comment about removing use of “job-steps” (pg. 15, line 27)
      • Kathryn: Looks like some terms, e.g., Work Load Manager (WLM), need to make sure to be consistent with usage between Ch1 & Ch2.
    • Slicing/Grouping of functionality Working Group
    • Dynamic Workflows Working Group
      • Jai: See slides on Dynamic Workflows
      • How PMIx can support applications w/ non-static environments
      • Focused on what it means to be “dynamic” (e.g., elastic, reconfigurable)
      • See slides for how to join working-group
    • Storage Working Group
      • Shane: See slides on Storage WG
      • Targeting few items to start: Orchestrating data movement across storage tiers, PMIx event notification use w/ storage systems
      • See slides for meeting details
  • Use case presentations
    • (None at this time)
  • Open call for new Working Groups
    • Kathryn: Any new WG calls?
    • Ralph: how about a Tools WG?
    • JoshH: Maybe a more focused Tools WG, a Debuggers WG?
    • JohnD: I’d be interested. Is this a carrot or stick?
    • Aurelien: Don’t think we need Resilience WG
    • Geoffroy: Possibly a Container WG, another level of indirection and may be useful
    • Stephen: That also hits the multi-version issue
    • Geoffroy: Also hits the client/server duality. Hierarchy of pmix servers/clients.
  • Technical presentations
    • Artem Polyakov (Mellanox) : PMIx_Get optimizations link
      • Stephen: What’s next critical path target?
      • Artem: working on the exchange. We did optimization of the protocol, incorporating the Bruck algo to do allgatherv (see EuroMPI’19 paper)
      • Will present that next time
    • Aurelien Bouteiller (UTK) : Runtime level failure detection and propagation link
      • Ralph: Time table for bringing changes into PRRTE?
      • Aurelien: In next 3-months
      • Josh: If wanting to play today, is it in ULFM distro?
      • Aurelien: No, but contact us and we can share with you.
    • Geoffroy Vallee : On-node resource manager for containerized HPC workloads link
      • Stephen: Very interesting. In architecture client/server slide diagram, how are the requested delivered/forwarded in hierarchy?
      • Geoffroy: Depends, may have ability to support locally. Working to improve integration and more investigations underway
      • Stephen: Dealing with similar things with Flux with hierarchies regarding nesting and which level you are querying
      • Ralph: May want to use attributes to set depth to limit how far you want to go after locally, to see how far you want to go to get info (all the way up hierarchy)
    • Artem Polyakov (Mellanox) : Modified Bruck Collective for PMIx_Fence link
      • May push to ASC Q2'2020
  • Discussion items:
    • (None at this time)
Clone this wiki locally