Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush the atm.log file at the end of init sequence #2962

Open
bartgol opened this issue Aug 22, 2024 · 8 comments
Open

Flush the atm.log file at the end of init sequence #2962

bartgol opened this issue Aug 22, 2024 · 8 comments

Comments

@bartgol
Copy link
Contributor

bartgol commented Aug 22, 2024

Avoids confusion when the code hangs due to initialization issues: if atm.log is there, and filled, we can be sure EAMxx::init completed.

@oksanaguba
Copy link
Contributor

add homme log flushing too?

@bartgol
Copy link
Contributor Author

bartgol commented Aug 29, 2024

That's definitely important, I just don't know (yet?) how do do it. The homme log is created in fortran, and I am not familiar with the flushing mechanisms of fortran. As far as I can tell, there is no direct way to flush a file in Fortran, short of closing it and reopening it. But I need to dig a bit more.

@ambrad
Copy link
Member

ambrad commented Sep 25, 2024

This came up while I was helping someone. To address the homme_atm.log case, I think I'm going to try inserting call flush(iulog) right before this line: https://github.com/E3SM-Project/E3SM/blob/ae514946f348917eaa3c23405447642a843f8436/components/homme/src/share/parallel_mod.F90#L276

@bartgol
Copy link
Contributor Author

bartgol commented Sep 25, 2024

I think this may only work if the rank issuing the abort is the same rank holding the iulog handler (I think only masterproc opens the log file).

Edit: nope, masterproc creates the file, but all ranks open it, so it should be good.

@ambrad
Copy link
Member

ambrad commented Sep 25, 2024

See E3SM-Project/E3SM#6646.

@ambrad
Copy link
Member

ambrad commented Sep 25, 2024

This can't fix every problem. EAM example: rank > root calls abortmp. iulog is attached to e3sm.log, and the rank has no knowledge of atm.log. So it flushes e3sm.log and then calls MPI_Abort. Other ranks get shut down without flushing, so the root rank won't flush atm.log.

This should fix the typical case, where rank = root does input checking.

It should also handle the case where rank k prints to iulog, then calls abortmp. Whatever log it printed to (atm or e3sm) will get flushed, showing the reason abortmp was called.

To cover every case, I think we need to introduce MPI error handlers (https://docs.open-mpi.org/en/main/man-openmpi/man3/MPI_Comm_create_errhandler.3.html) so ranks can clean things up before exiting.

@ambrad
Copy link
Member

ambrad commented Oct 1, 2024

For EAMxx's atm.log, I think flushing it in the catch block that was recently modified to call MPI_Abort would do the trick for most cases (any case that throws an exception that eventually leads to the top-level catch block).

Another case is EAMxx's homme_atm.log when Hxx throws an exception. Not sure whether the AD has a handle to this log. If it does, it could be flushed in the top-level catch block, too.

@bartgol
Copy link
Contributor Author

bartgol commented Oct 2, 2024

I don't think the AD has a handle to that log, unfortunately. However, it should already get flushed. Here's what happens if the catch block executes:

  • the catch block is entered
  • the EAMxx singleton is cleaned up
  • when cleaning up the singleton, the AD is destroyed
  • when AD is destroyed, it calls finalize
  • AD::finalize finalizes (and destroys) all atm procs
  • HommeDynamics::finalize calls prim_finalize_f90
  • prim_finalize_f90 closes the log via close_homme_log

But close_homme_log does NOT call "flush" on it. Maybe that's the fix.

Edit: no, calling close(iulog) should already flush the file. It must be that when the code crashes, we don't get to that call. Perhaps a second exception is thrown while the code in the catch section is run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants