-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MPI-IO collective reads of NetCDF-4 files #1405
Conversation
…tions within NetCDF. This has been tested on Cactus using the RRFS application with up to 24 ensemble members.
….ParallelStartup Merged with GFDL main
…t behavior is unchanged, user has to activate collective reads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the inline comment, please remove the commented out debug statements.
fms2_io/netcdf_io.F90
Outdated
if(string_compare(trim(fileobj%path), trim(fn_collective(i)), .true.)) fileobj%use_collective = .true. | ||
enddo | ||
if(fileobj%use_collective) then | ||
err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=mpp_comm_private, info=MPI_INFO_NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the use of mpp_comm_private here imply that all ranks associated with reads and writes the the named file will block until all mpi-ranks associated with that communicator have contributed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only working with reads, but yes those reads are collective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugging prints removed from fms2_io/include/netcdf_read_data.inc and fms2_io/netcdf_io.F90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, the NF90_NOWRITE specifies read-only. My concern is mpp_comm_private is not the correct communicator to be using here. For the general use of libFMS, mpp_comm_private is the MPI_COMM_WORLD global communicator, i.e.all mpi-ranks for the job as a whole. For the UFS, mpp_comm_private in libFMS is initialized with a local communicator that is a subset of the global mpi-ranks. I am trying to ensure the proper communicator will be in effect for all uses of parallel NetCDF within libFMS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One intended use case involves a GCM with nested domains, thus the communicator used in this nf90_open() call might change. I have not done any testing with that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From line 862 of FV3/io/fv3atm_restart_io.F90
call read_restart(Phy_restart, ignore_checksum=ignore_rst_cksum)
The read_restart() routine is from fms2_io_mod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkokron Thanks
I am assuming Phy_restart
is defined as FmsNetcdfDomainFile_t
?
I think the code is missing a call to mpp_set_current_pelist
. If you can push what you currently have I will take a better look. I think I can test it out with our unit tests.
Do you get better performance for the HAFS nested grid case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Phy_restart is of type FmsNetcdfDomainFile_t.
I am still investigating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uramirez8707 Do you have access to the WCOSS2 test system 'acorn'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not
Pete Johnsen had these questions.
My responses:
|
…_get_current_pelist_comm to gain access to the appropriate communicator
…communicator associated with the current pelist
I'd need to mimic the following code using only FMS calls.
What I need is an MPI communicator that contains the ranks associated with each tile of a domain2d. I think the following should accomplish the task. Is there a better way?
|
@dkokron - I'm poking around in FMS and the domains structure to find the info you are asking about. We've been following your commits and discussing them internally. I'd like to propose another meeting to share some of our observations. |
@bensonr I'm free 11am to 2pm central every day. Send me an invite. |
How does one prepare and run the unit test suite when FMS is compiled using CMake? |
Unfortunately our test suite is currently only available via the autotools build since it's our main supported build system. We have an open issue for it here: #827. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for discussion
fms2_io/include/netcdf_read_data.inc
Outdated
@@ -362,8 +362,12 @@ subroutine netcdf_read_data_2d(fileobj, variable_name, buf, unlim_dim_level, & | |||
type is (integer(kind=i8_kind)) | |||
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e) | |||
type is (real(kind=r4_kind)) | |||
! NetCDF does not have the ability to specify collective I/O at the file basis | |||
! so we must activate at the variable level | |||
if(fileobj%use_collective) err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inside a if (fileobj%is_root) then
so only the first rank in the pelist is going to get to the nf90_var_par_access
and nf90_get_var
calls
fms2_io/include/netcdf_read_data.inc
Outdated
@@ -362,8 +362,12 @@ subroutine netcdf_read_data_2d(fileobj, variable_name, buf, unlim_dim_level, & | |||
type is (integer(kind=i8_kind)) | |||
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e) | |||
type is (real(kind=r4_kind)) | |||
! NetCDF does not have the ability to specify collective I/O at the file basis | |||
! so we must activate at the variable level | |||
if(fileobj%use_collective) err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective) | |||
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for domain decomposed files, c
and e
correspond to the section of the data for all of the PEs in the pelist:
FMS/fms2_io/include/domain_read.inc
Lines 183 to 185 in 7f58528
call netcdf_read_data(fileobj, variable_name, buf_i4_kind, & | |
unlim_dim_level=unlim_dim_level, & | |
corner=c, edge_lengths=e, broadcast=.false.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the netcdf_read_data
calls will need to be modified in domain_read_*d so that each ranks reads their own section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uramirez8707 Those c and e arrays have had the appropriate values during all of my testing so far. We'll see if that changes after removing the io_layout=1,1 'fix'.
diff --git a/tools/fv_mp_mod.F90 b/tools/fv_mp_mod.F90
index 4f2c0f2..f1e6012 100644
--- a/tools/fv_mp_mod.F90
+++ b/tools/fv_mp_mod.F90
@@ -658,12 +658,12 @@ contains
!--- if io_layout\=(1,1) then read io_layout=io_layout (no change)
l_layout = mpp_get_io_domain_layout(domain)
call mpp_copy_domain(domain, domain_for_read)
- if (ALL(l_layout == 1)) then
- call mpp_get_layout(domain, l_layout)
+ !if (ALL(l_layout == 1)) then
+ ! call mpp_get_layout(domain, l_layout)
+ ! call mpp_define_io_domain(domain_for_read, l_layout)
+ !else
call mpp_define_io_domain(domain_for_read, l_layout)
- else
- call mpp_define_io_domain(domain_for_read, l_layout)
- endif
+ !endif
endif
deallocate(pe_start,pe_end)
fms2_io/include/netcdf_read_data.inc
Outdated
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e) | ||
type is (real(kind=r8_kind)) | ||
if(fileobj%use_collective) err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if use_collective
is true then bdcast
needs to be set to false
FMS/fms2_io/include/netcdf_read_data.inc
Lines 149 to 153 in 7f58528
if (present(broadcast)) then | |
bcast = broadcast | |
else | |
bcast = .true. | |
endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thank you.
fms2_io/netcdf_io.F90
Outdated
! so we must activate at the variable level in netcdf_read_data_2d() and netcdf_read_data_3d() | ||
if(fileobj%use_collective .and. fileobj%TileComm < 0) then | ||
!write(6,'("netcdf_file_open: Open for collective read "A,I4)') trim(fileobj%path), szTile | ||
err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=fileobj%TileComm, info=MPI_INFO_NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will this work for reading metadata? Currently a root rank reads the data and broadcasts to the other ranks. Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bensonr Do you want to maintain support for the following read scenarios?
- fileobj%is_root (single rank) reads then broadcasts. I believe this is the situation when the io_layout=(1,1) WAR isn't active.
- all ranks associated with a file read independently (parallel-independent). This is the situation when the io_layout=(1,1) WAR is active)
- all ranks associated with a file read collectively (parallel-collective).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way, are you proposing UFS eliminates the io_layout=(1,1) work-around from their code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkokron - I thought I answered this, but see I never hit the comment button. I am proposing the UFS eliminate the workaround as the new path through FMS with NC-4 files should behave the same way. Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only testing for NetCDF-4 format after a failed attempt to open for parallel-collective.
In order to obtain the new FMS path, the user needs to modify their code before each open_file() for which they want the new path (currently 4 file types out of the many 10s).
e.g.
+ Phy_restart%use_collective = .true.
+ Phy_restart%TileComm = mpp_get_tile_comm(fv_domain)
amiopen=open_file(Phy_restart, trim(fname), 'read', domain=fv_domain, is_restart=.true., dont_add_res_to_filename=.true.)
The way things are coded now (after removing the new path from those is_root blocks and removing the io_layout=(1,1) workaround), the default path is still read-broadcast.
Let me think about this some more.
fms2_io/netcdf_io.F90
Outdated
@@ -32,6 +32,7 @@ module netcdf_io_mod | |||
use mpp_mod | |||
use fms_io_utils_mod | |||
use platform_mod | |||
use mpi, only: MPI_INFO_NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This MPI dependency should be removed. It can be made public through mpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this dependency once FMS provides the needed functionality via some other means.
Does this PR still have a path forward? The RRFS development team would love to utilize it within a true FMS release. |
@MatthewPyle-NOAA - yes, this still has a path forward. I estimate having something you cana test with by Feb 8th |
Matthew,
Activating the new FMS feature in RRFS requires some code changes. I
should run some testing in my sandbox before you try it out. For my
testing to be relevant, I need an updated version of the RRFS code. I have
version 0.6.1. Shun Liu has promised to get me the latest. Can you check
with Shun on that request?
Dan
…On Tue, Jan 30, 2024 at 12:56 PM Rusty Benson ***@***.***> wrote:
@MatthewPyle-NOAA <https://github.com/MatthewPyle-NOAA> - yes, this still
has a path forward. I estimate having something you cana test with by Feb
8th
—
Reply to this email directly, view it on GitHub
<#1405 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2FWKXKM57SACPLSIGDYRE67HAVCNFSM6AAAAAA7B46T2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXGY4TENZRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dkokron I just pinged Shun via e-mail - hopefully one way or another we'll be able to point you at an updated model code by next week. |
@dkokron @MatthewPyle-NOAA - see PR #1457 |
Haven't heard back from Shun.
…On Mon, Feb 12, 2024, 12:10 PM MatthewPyle-NOAA ***@***.***> wrote:
Thanks @bensonr <https://github.com/bensonr>. @dkokron
<https://github.com/dkokron> do you still need fresh RRFS code to test
with?
—
Reply to this email directly, view it on GitHub
<#1405 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2FATWLYIO6DYLFS42DYTJLJNAVCNFSM6AAAAAA7B46T2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGI3TMNRUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dan, We test these modification in RRFS realtime parallel. With these changes, the run time of 1h forecast is reduced from 1350s to 950s. I think these changes should be merged into develop. Thank you so much for your help. |
I wouldn't add the changes to rrfs until we get an FMS release with the
required modifications. Remember, my FMS build on Dogwood is NOT official.
On Wed, Feb 21, 2024, 12:41 PM ShunLiu-NOAA ***@***.***>
wrote:
… Haven't heard back from Shun.
… <#m_6491765411891386886_>
On Mon, Feb 12, 2024, 12:10 PM MatthewPyle-NOAA *@*.*> wrote: Thanks
@bensonr <https://github.com/bensonr> https://github.com/bensonr
<https://github.com/bensonr>. @dkokron <https://github.com/dkokron>
https://github.com/dkokron <https://github.com/dkokron> do you still need
fresh RRFS code to test with? — Reply to this email directly, view it on
GitHub <#1405 (comment)
<#1405 (comment)>>, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/ACODV2FATWLYIO6DYLFS42DYTJLJNAVCNFSM6AAAAAA7B46T2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGI3TMNRUGQ
<https://github.com/notifications/unsubscribe-auth/ACODV2FATWLYIO6DYLFS42DYTJLJNAVCNFSM6AAAAAA7B46T2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGI3TMNRUGQ>
. You are receiving this because you were mentioned.Message ID: @.*>
Dan, We test these modification in RRFS realtime parallel. With these
changes, the run time of 1h forecast is reduced from 1350s to 950s. I think
these changes should be merged into develop. Thank you so much for your
help.
—
Reply to this email directly, view it on GitHub
<#1405 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2EKFALLEMNP7IMTKXTYUY5UFAVCNFSM6AAAAAA7B46T2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJXGY3TANJRHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dkokron - there is now an alpha tag with the ability to declare a pelist and get the communicator directly as well as to query a domain to get individual tile communicators or the full domain communicator. We'd like to include your PR in the release, so let us know how much time you think you'll need to merge/implement within the current working branch. Don't forget to include logic specific unit tests for the MPI-IO capabilities. |
As far as I am aware, the only place where I am allowed to do this sort of development is on Acorn. That system is down for software work until at least 1 March. |
@bensonr Do you suggest overlaying the parallel IO work on top of 2024.01-alpha4? |
@dkokron We discussed this a bit with Rusty, it would be best to put the updates in on the main branch instead. |
I will overlay on main. |
@bensonr @rem1776 The user application needs some way to tell FMS which MPI communicator is associated with which file. I did that in UFS using the following code. I don't see the equivalent of mpp_set_tile_comm() in the current FMS 'main'. Which FMS function should I be using to achieve the functionality of mpp_set_tile_comm()?
|
@dkokron - there are two new interfaces you can use to get each the current tile commID (mpp_get_domain_tile_commid) and the full domain commID (mpp_get_domain_commid). For a single tile domain, the tile and domain commIDs should be the same. There is also a new optional commID argument to mpp_declare_pelist that will return the commID for any pelist group created. |
@bensonr I tried using mpp_get_domain_tile_commid() in the above sequence, but it failed at runtime. I'll investigate some more and report back here. |
@bensonr Does FMS provide a way to access MPI_COMM_NULL? |
I think I accidentally closed this PR when I renamed my fms.ParallelStartup branch to fms.ParallelStartupLegacy. The fms.ParallelStartup branch is now built on top of recent FMS:main. Should I open a new PR? |
@dkokron Yeah you'll need to open a new PR from the new branch. With github it seems there's no way to change the branch you're merging in once the PR is created. And to answer your earlier question, |
MPI_INFO_NULL and MPI_COMM_NULL are not the same thing. /opt/cray/pe/mpich/8.1.12/ofi/intel/19.0/include/mpif.h: INTEGER MPI_INFO_NULL |
@dkokron - sorry have been busy all day. It doesn't look like we are pushing out MPI_COMM_NULL in FMS. If you need that, we can implement it as part of your PR in a fashion similar to the way MPI_INFO_NULL is exposed as MPP_INFO_NULL. If you think we need to have a call for anything, let me know and we can set one up for tomorrow early morning or late afternoon. |
@dkokron what happened to this PR? |
@thomas-robinson It got closed accidentally when I updated the underlying branch to a more recent FMS branch. I will open a new PR once testing of the new code is complete. |
New PR at |
NetCDF-4, using the HDF5 file layout, has the ability to do parallel I/O in two different modes. The two modes are referred to as “independent” while the second mode is referred to as “collective”. The collective mode has been tested with a few NOAA workloads and shown to provide substantial improvement in job startup time while reducing negative impact on the underlying Lustre file system. The collective mode is not currently available in FMS (October 2023).
This PR does not address parallel I/O via pNetCDF.
This PR adds an option to enable collective reads. The user controls that option via settings in input.nml. The default behavior is unchanged, the user has to activate collective reads using the settings in input.nml.
Fixes # 1322
#1322
How Has This Been Tested?
I have run a RRFS case on WCOSS2 with and without collective reads activated. The resulting binary restart files are zero-diff.
The compile time environment used to compile FMS was
Currently Loaded Modules:
The compile and runtime environment for RRFS was
Please describe the tests that you ran to verify your changes. Please also note
any relevant details for your test configuration (e.g. compiler, OS). Include
enough information so someone can reproduce your tests.
Checklist:
The current CMake test for Parallel NetCDF via (nc-config --has-parallel) works in my case. Testing via (nc-config --has-parallel4) is preferred.
make distcheck
passes