Skip to content

2017 03 14

Aurelien Bouteiller edited this page Mar 14, 2017 · 1 revision

MPI Error management WG March 14, 2017

Attendees

  • Aurelien Bouteiller (UTK)
  • Murali Emani (LBNL)
  • Keita Teranishi (Sandia)
  • Nawrin Sultana (Auburn)
  • Alexander Calvert (Auburn)

Presentation material

https://docs.google.com/presentation/d/1YOccLbrHd42vUtgt0KZWymXME0HtlVg8BndnIe3n6jc/edit?usp=sharing

Topics

Rescheduling the WG meeting

The doodle poll has yielded results. Proposed biweekly 2pm CST from March 29 on.

tickets #1 and #28

Wesley Absent, this topic has been left dormant.

Local+global recovery

auto jumping err handlers

  • As discussed in the WG f2f, auto jumping is problematic. Keita notes that one can jump only in a parent of the stack frame, which limits where the setjmp can be done (not in a function call, basically, and in most cases only in main is sane).
  • Murali to investigate how MPI_Reinit deals with the issue. If they have a nice solution, we can reuse, but as it seems now, auto jumping is out of the question and must be delegated to users.
  • As it stands, it seems that we can support longjmp only if we have language support for it, which is not available in Fortran/C
    • note however that we have implemented transactions with macros that do setjmp/longjmp, and used longjmp from error handlers, but all under user's control, where function call nesting is not MPI's problem.

scoped reinit-like approaches

  • One may be able to set a "reinit" error handler on some communicator. The application could thus alternate between "reinit" phases and local recovery phases, or execute reinit on a section, and local recovery on another (or multiple independent instances of reinit, even).
  • still subject to the longjmp complication... + where do we jump if not after MPI_Init? Proposed we get out of MPI_COMM_DUP, could work, but as long a we are unsure it can be cleanly implemented, this is tentative.

global_revoke

  • Keita confirms this function would ease Fenix implementation.
  • Exploring the idea of "descendants_revoke" that would revoke all communicators created from the revoked communicator i.e. calling it on MPI_COMM_WORLD would revoke everything but SELF and get_parent()).
    • idea is amusing, not sure yet if it can be made to work, or how much the forum would receive these new concepts of descendent communicators/windows/files.
Clone this wiki locally