-
Notifications
You must be signed in to change notification settings - Fork 0
2017 09 13
Wesley Bland edited this page Sep 13, 2017
·
1 revision
- Intel - Wesley, Rob
- Argonne, Ken, Yanfei
- UTK - Aurelien
- Auburn - Nawrin
- LLNL - Ignacio, Murali
- UT Chattanooga - Tony
- ORNL - Geoffroy
- WG Time
- Briefly? discuss catastrophic errors again
- Move forward on process failure (ULFM, Reinit, etc.)
- Reading
- Read error handlers
- Went over slides and PDF for reading
- Want to change one sentence in advice to implementors in Section 8.2.
- This should be a small enough change to be acceptable. Will point it out separately.
- Discussed current proposal and decided that we're still happy with it.
- Global state of
MPI_GET_STATE
is ok because if any thread is catastrophic, all threads are catastrophic and can't recover anyway.- If you're checking the state, you're probably going to do it in an error handlers so you'll know which error code to look for to find out about the error.
- Bill Gropp was asking us to look at things like what POSIX does for errors, but it's difficult to replicate that in MPI because of the much larger amount of state that MPI has to maintain across multiple processes. POSIX is more local and stateless (or the state lives in the user's data).
- We might end up needing more error classes so we can give the user specific information about errors.
- Might be ready to move forward on a December reading here.
- Aurelien proposed adding a
MPI_COMM_REVOKE_ALL
function to resolve the deadlock problem with overlapping communicators.- Others were skeptical because you might always have to assume that you need to revoke all communicators any time you have overlapping communication.
- Aurelien asserted that having concurrent communication with overlapping communicators is not common and might not be as bad as we think.