Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch3: use group to build communicator vc tables #7242

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Dec 19, 2024

Pull Request Description

Based on #7235, #7237

Now the local_group and remote_group in MPIR_Comm can fully replace the functions of mapper, refactor ch3 to use group instead of mapper in MPIDI_CH3I_Comm_commit_pre_hook.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2412_ch3_vcrt branch 6 times, most recently from 1ab1165 to 9cc9027 Compare December 20, 2024 15:58
@hzhou hzhou force-pushed the 2412_ch3_vcrt branch 5 times, most recently from 706bc6a to 033a063 Compare December 21, 2024 14:07
@hzhou hzhou changed the title ch3: refactor to remove the usage of comm mapper ch3: use group to build communicator vc tables Dec 21, 2024
@hzhou hzhou marked this pull request as ready for review December 21, 2024 15:51
@hzhou hzhou force-pushed the 2412_ch3_vcrt branch 4 times, most recently from 3a44553 to 27cdee9 Compare December 22, 2024 17:29
@hzhou
Copy link
Contributor Author

hzhou commented Dec 22, 2024

test:mpich/ch3/most
test:mpich/ch4/most
All ✔️

hzhou added 10 commits December 22, 2024 15:43
Miscellenous typo fixes to appease the spellchecker.
This test requires to access MPICH internals, thus won't be used with
the current design.
We no longer use this file.
Hide the internal fields of MPIR_Group from unnecessary access.

Outside group_util.c and group_impl.c, it only need assume the MPIR_Lpid
integer type, creation routines based on lpid map or lpid stride
description, and access routine to look up lpid from a group rank.
For most external usages, we only need MPIR_Group_rank_to_lpid.
Avoid access group internal fields.
Group similar functions together to facilitate refactoring.
There is no changes in this commit other than moving functions around.

The 4 incl/excl functions are very similar.

The 3 difference/intersection/union functions are very similar.
Use MPIR_Group_{rank_to_lpid,lpid_to_rank} to avoid directly access
MPIR_Group internal fields.

For most group creation routines, just populate an lpid lookup map and
call MPIR_Group_create_map to create the group.
* add option to use stride to describe group composition
* remove the linked list design
hzhou added 23 commits December 22, 2024 15:43
This is the same as MPID_Comm_get_lpid.

NOTE: we'll will remove MPID_Comm_get_lpid as well once we move the
ownership of lpid to the MPIR-layer.
There is no real difference between lpid and gpid. Thus rename gpid in
the device layer to lpid for clarification.

Replace the usage of uint64_t as the type of lpid to MPIR_Lpid. This
improves consistency.
We need a device-independent way of identifying processes. One way is to
use the combination of (world_idx, world_rank). Thus, we need maintain a
list of worlds so that the world_idx points to the world record.

This may not fit in the concept of MPI group, but since the group need a
ways of id processes, thus it seems most closely related.

The first world, world_idx 0, is always initialized at init.

Due to session re-init, we need make sure to reset num_worlds to 0 at
finalize.

New worlds will be added upon spawning or connecting dynamic processes
(to-be-implemented).
We need reset num_worlds so that Session re-init will work.
Add builtin MPIR_GROUP_WORLD and MPIR_GROUP_SELF, so we can create
builtin communicators from builtin groups.
Internally the only reason to duplicate a group is to copy from NULL session
to a new session.  Otherwise, we can just use the same group and increment the
reference count.
Since builtin groups can be returned to users, they should be allowed
to free. They are reference counted anyway.
To make MPI group a first-class citizen, we will always have group
before creating communicators, so that when device layer activate
communiators, e.g. in MPID_Comm_commit_pre_hook, it can rely on the
group to look up the involved processes. It also removes the necessity
to maintain any other process addressing schemes.
Many places we just return MPIR_Group_empty without increment the
ref_count. This is fixable. But for now, let's avoid freeing it.
The init_comm does the release manually.
Add assertions to make sure the local_group and remote_group (for
inter communicators) are always set before MPID_Comm_commit_pre_hook.
Otherwise, the MPI_T functions may not able to convert
builtin datatypes.
When we run tests as functions, the stray output in MPI_Finalize, such
as the debug messages in debug builds, are not captures previously. This
patch make sure we report such stray output as failures.
Now that we always have group inside a communicator, we can simply
return the lpid from the group.

Because this will be used in the hot path, make it inline.
Add the following macros:
    MPIR_LPID_WORLD_INDEX
    MPIR_LPID_WORLD_RANK
    MPIR_LPID_FROM
Fix a typo in setting the size of MPIR_GROUP_SELF.

Add ref_count if we return MPIR_GROUP_EMPTY to prevent freeing the
builtin when it is released internally. Unfortunately, since user can
directly use MPI_GROUP_EMPTY, we can't keep ref_count accurate. But at
least we can keep it positive to prevent an actual free.
The builtin groups are in session NULL. We need duplicate the groups in
MPIR_Group_from_session_pset_impl to return a group in the correct
session.
Group are a natural place to host vcrt (virtual connection reference
table). When communicators are duplicated, groups are simply inherited
and reference counted. Thus we won't end up with duplication of vcrt.
Because the tmp_comm uses a temporary vc that doesn't belong to any pg,
it is incompatible to the new comm init process (that relies on lpid
lookup to construct vcrt tables).

Turns out we only need tmp_comm to perform basic send/recv
(MPIC_Sendrecv) and we don't need most of the facility of a normal
communicator. Shortcut the tmp_comm construction and destroy greatly
simplifies the code.
Replace the usage of mapper with comm->local_group and
comm->remote_group in MPIDI_CH3I_Comm_commit_pre_hook.
The only logic for whether to release a vc is whether this vc is for a
dynamic process. It has nothing to do with the whether
MPI_Comm_disconnect is called. The semantics of MPI_Comm_disconnect is
just to wait for all communication complete. It is orthogonal to how the
comm is destroyed.
In MPIR_Comm_create_inter, we know whether the remote group is empty
after the exchange, thus it is unnecessary to create and commit the
intercomm then delete it later. Simply don't create it in the first
place.

The device layer is not necessarily equipped to handle intercomm commit
with empty groups.
@hzhou hzhou requested a review from yfguo January 2, 2025 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant