Review implementation of GA_Sync #31

bjpalmer · 2017-04-10T22:35:45Z

The current implementation of GA_Sync relies on pnga_pgroup_sync. If the group is not the world group, then the call loops over all processes in the group and calls ARMCI_Fence(group_id, iproc). The current implementation of this function in the MPI3 port is to call MPI_Win_flush(proc, win) on all windows associated with the world group. This is wrong, but I don't think there is any way to implement the ARMCI_Fence operation using MPI RMA that doesn't require an order P data structure. At any rate, I think this implementation is not the way to go, since what we want to do is flush all processes in the group. Unfortunately, we don't seem to have something like ARMCI_FenceGroup. This could be easily implemented in MPI RMA but may cause problems with the other ports. Since there is a comex_fence_all function, it should be easy to implement this operation for any of the MPI based ports, but we may run into problems with the existing IB port.

jeffhammond · 2017-04-10T23:12:10Z

@bjpalmer I argue that is the right way to implement it. It is how we do it in ARMCI-MPI (https://github.com/jeffhammond/armci-mpi/blob/master/src/util.c#L67).

In every MPI implementation that I understand, MPI_Win_flush(_all) is only expensive when there are operations in flight that need to be completed.

In any case, an O(num_windows) implementation is more scalable than an O(num_processes) one, since most GA codes do not use that many windows. I don't know if it is still the case, but in the past, GA had a hard-coded limit of 1000, and I don't think NWChem uses more than 100 at a time.

bjpalmer · 2017-04-11T15:50:38Z

Actually, now that I think about it, it shouldn't be too difficult to implement ARMCI_FenceGroup in the IB port. We just need to loop over all procs in the group and call the existing ARMCI_Fence.

bjpalmer · 2017-04-25T23:56:52Z

I fooled around with this and got a version of ARMCI_GroupFence working in the ComEx branch. It didn't have any effect on the failures of perf2 etc. when running under the MPI3 port and using data types. After the call to ARMCI_Fence or ARMCI_GroupFence, the sync functions calls pnga_msg_group_sync. The pnga_msg_pgroup_sync function is basically a wrapper for armci_msg_group_barrier. The ComEx implementation of this function then calls comex_barrier followed immediately by MPI_Barrier. The implementation of comex_barrier in src-mpi3 consists of a call to comex_fence_all followed by another call to MPI_Barrier. The comex_fence_all is calling MPI_Win_flush_all, so adding the ARMCI_GroupFence operation had no real effect on ga_sync or ga_pgroup_sync, since they are effectively calling MPI_Win_flush_all anyway. As near as I can figure, each call to sync is leading to approximately the following sequence of MPI calls

MPI_Win_flush_all
MPI_Win_flush_all
MPI_Barrier
MPI_Barrier

The implementation for the MPI3 port is following some of the other ComEx implementations, so there is likely a significant amount of redundancy in sync in some of the other ComEx ports as well. Apart from the cruft, we do not have a clean distinction between fence and barrier in our code. We should try and clean this up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review implementation of GA_Sync #31

Review implementation of GA_Sync #31

bjpalmer commented Apr 10, 2017

jeffhammond commented Apr 10, 2017

bjpalmer commented Apr 11, 2017

bjpalmer commented Apr 25, 2017

Review implementation of GA_Sync #31

Review implementation of GA_Sync #31

Comments

bjpalmer commented Apr 10, 2017

jeffhammond commented Apr 10, 2017

bjpalmer commented Apr 11, 2017

bjpalmer commented Apr 25, 2017