Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: fix fences in a2a's WQEs #1069

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Jan 9, 2025

[Edited]

What

Fix fences in WQEs.

QPs used by each node leader:

  • a QP with loop-back connection, which we call "UMR QP"
  • one QP per remote peer (+1 with loop-back connection), which we call "RDMA QP"

Here the different WQEs in the algorithm:

  1. UMR WQE posted to "UMR QP". CPU then blocks until completion of this WQE.
  2. Transpose WQE+ RDMA write WQE + atomic fetch_and_add WQE, all posted to "RDMA QP"
  3. wait-on-data posted to "UMR QP"

Conclusion regarding flags:

  • UMR WQE doesn't need any fence since it is the first WQE posted.
  • The transpose WQE, which consumes UMR, needs a small fence. doesn't need any fence.
  • RDMA write, which consumes transpose, needs a small fence because it only uses local data
  • atomic fetch_and_add WQE needs a strong fence because it signals completion of RDMA write to the remote peer
  • Wait-on-data doesn't need any fence.

@raminudelman
Copy link

raminudelman commented Jan 9, 2025

Regarding:

UMR WQE doesn't need any fence since it is the first WQE posted

Do we have a single UMR per AllToAll operation? If the answer is yes - then I agree, no need for any fence (assuming we also have a barrier and the UMR does not change/modify any Mkey that maps memory that might be accessed by previous operations).

@samnordmann
Copy link
Collaborator Author

Regarding:

UMR WQE doesn't need any fence since it is the first WQE posted

Do we have a single UMR per AllToAll operation? If the answer is yes - then I agree, no need for any fence (assuming we also have a barrier and the UMR does not any memory that might be accessed by previous operations).

We have two UMRs per AllToAll, on for the send and one for the recv key. Indeed the src and recv buffers are assumed to be ready when the collective is called, and UMRs are the first WQEs

struct mlx5dv_qp_ex * mqp = mlx5dv_qp_ex_from_ibv_qp_ex(qp_ex);
struct mlx5_wqe_ctrl_seg * ctrl;
struct mlx5_wqe_umr_ctrl_seg * umr_ctrl_seg;
uint8_t fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hangs because if you don't set CE to be 0x2 (as specified by MLX5_WQE_CTRL_CQ_UPDATE) - you won't get a CQE for this WQE you're posting...

@samnordmann
Copy link
Collaborator Author

samnordmann commented Jan 10, 2025

Regarding:

UMR WQE doesn't need any fence since it is the first WQE posted

Do we have a single UMR per AllToAll operation? If the answer is yes - then I agree, no need for any fence (assuming we also have a barrier and the UMR does not any memory that might be accessed by previous operations).

We have two UMRs per AllToAll, on for the send and one for the recv key. Indeed the src and recv buffers are assumed to be ready when the collective is called, and UMRs are the first WQEs

@raminudelman

Sorry my previous PR description and analysis was misleading. I edited the description and changed further the flags, please see the last commit.

@raminudelman
Copy link

I'm not sure why:

wait-on-data posted to "UMR QP"

Since IIUC it can also be posted on an "RDMA QP" but I don't think it matters.

Another note:
A clear optimization space that is left here on the table is the fact that the CPU blocks until the UMR WQE is completed. This latency is exposed (from UCC's AllToAll perspective...). I would want to see a breakdown of how much latency we expose here in different systems scales. If it's <5% of the latency of the AllToAll probably there is not need to try to optimize it but if it's more I think it might be worthwhile. I expect it to be obviously exposed in a smaller scale rather than in large scale systems.

@raminudelman
Copy link

Regarding

atomic fetch_and_add WQE needs a strong fence because it signals completion of RDMA write to the remote peer

IIRC, InfiniBand's ordering semantics guarantee that Atomic operations are executed in order (according to the message/PSN ordering) on the responder side. So, no need to indicate "fence" in atomic operation on the requestor side. @samnordmann, please double check this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants