Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review implementation of GA_Sync #31

Open
bjpalmer opened this issue Apr 10, 2017 · 3 comments
Open

Review implementation of GA_Sync #31

bjpalmer opened this issue Apr 10, 2017 · 3 comments

Comments

@bjpalmer
Copy link
Member

The current implementation of GA_Sync relies on pnga_pgroup_sync. If the group is not the world group, then the call loops over all processes in the group and calls ARMCI_Fence(group_id, iproc). The current implementation of this function in the MPI3 port is to call MPI_Win_flush(proc, win) on all windows associated with the world group. This is wrong, but I don't think there is any way to implement the ARMCI_Fence operation using MPI RMA that doesn't require an order P data structure. At any rate, I think this implementation is not the way to go, since what we want to do is flush all processes in the group. Unfortunately, we don't seem to have something like ARMCI_FenceGroup. This could be easily implemented in MPI RMA but may cause problems with the other ports. Since there is a comex_fence_all function, it should be easy to implement this operation for any of the MPI based ports, but we may run into problems with the existing IB port.

@jeffhammond
Copy link
Member

@bjpalmer I argue that is the right way to implement it. It is how we do it in ARMCI-MPI (https://github.com/jeffhammond/armci-mpi/blob/master/src/util.c#L67).

In every MPI implementation that I understand, MPI_Win_flush(_all) is only expensive when there are operations in flight that need to be completed.

In any case, an O(num_windows) implementation is more scalable than an O(num_processes) one, since most GA codes do not use that many windows. I don't know if it is still the case, but in the past, GA had a hard-coded limit of 1000, and I don't think NWChem uses more than 100 at a time.

@bjpalmer
Copy link
Member Author

Actually, now that I think about it, it shouldn't be too difficult to implement ARMCI_FenceGroup in the IB port. We just need to loop over all procs in the group and call the existing ARMCI_Fence.

@bjpalmer
Copy link
Member Author

I fooled around with this and got a version of ARMCI_GroupFence working in the ComEx branch. It didn't have any effect on the failures of perf2 etc. when running under the MPI3 port and using data types. After the call to ARMCI_Fence or ARMCI_GroupFence, the sync functions calls pnga_msg_group_sync. The pnga_msg_pgroup_sync function is basically a wrapper for armci_msg_group_barrier. The ComEx implementation of this function then calls comex_barrier followed immediately by MPI_Barrier. The implementation of comex_barrier in src-mpi3 consists of a call to comex_fence_all followed by another call to MPI_Barrier. The comex_fence_all is calling MPI_Win_flush_all, so adding the ARMCI_GroupFence operation had no real effect on ga_sync or ga_pgroup_sync, since they are effectively calling MPI_Win_flush_all anyway. As near as I can figure, each call to sync is leading to approximately the following sequence of MPI calls

MPI_Win_flush_all
MPI_Win_flush_all
MPI_Barrier
MPI_Barrier

The implementation for the MPI3 port is following some of the other ComEx implementations, so there is likely a significant amount of redundancy in sync in some of the other ComEx ports as well. Apart from the cruft, we do not have a clean distinction between fence and barrier in our code. We should try and clean this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants