Legion: collective instance freeze on slingshot-11 #1729

syamajala · 2024-07-26T12:03:37Z

I believe using collective instances results in a startup freeze on slingshot-11. I have one commit of S3D that uses them (https://gitlab.com/legion_s3d/legion_s3d/-/commit/e797d71367683580933166a0080a3dbf3f98b978) and freezes at startup and another commit (https://gitlab.com/legion_s3d/legion_s3d/-/commit/5455c8c03e67c32f2fcbee1120d7a40c37486823) where I specifically backed out those changes and it no longer freezes. We will probably need to investigate this with HPE.

elliottslaughter · 2024-07-26T16:45:56Z

What scale is required to hit this freeze? (Nodes, ranks, GPUs/rank?)
Can you confirm that the collective instance freeze definitely does not occur on at least one other network (e.g., Infiniband)? Ideally at a similar scale to above. This would help rule out a potential Legion issue.

syamajala · 2024-07-26T16:48:31Z

The subranks branch starts to freeze at 8 nodes. The tdb branch starts to freeze at 4 nodes, 4 ranks/node, 2 gpus/rank. I can try tdb on sapling. It was working on blaze the last time I tried, but blaze is currently down.

syamajala · 2024-07-26T16:49:29Z

Actually looks like blaze is back up. Will try it there.

syamajala · 2024-07-26T17:45:02Z

The network drive on blaze is still down so I couldnt run there, but I built and ran tdb on sapling. Ran on all 4 nodes, 4 ranks/node, 1 gpu/rank. It started up and ran fine. This is probably a slingshot issue.

elliottslaughter · 2024-07-26T17:47:13Z

Sapling has 4 GPUs per node. Could we run 16 ranks, 4 ranks/node, 1 GPU/rank?

syamajala · 2024-07-26T17:48:31Z

Sorry I meant 4 ranks/node. Not 1 rank/node.

lightsighter · 2024-07-26T20:23:54Z

I'm going to assume this is a Slingshot issue unless we can reproduce it on an Infiniband machine.

syamajala added the S3D label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legion: collective instance freeze on slingshot-11 #1729

Legion: collective instance freeze on slingshot-11 #1729

syamajala commented Jul 26, 2024

elliottslaughter commented Jul 26, 2024

syamajala commented Jul 26, 2024

syamajala commented Jul 26, 2024

syamajala commented Jul 26, 2024 •

edited

Loading

elliottslaughter commented Jul 26, 2024

syamajala commented Jul 26, 2024

lightsighter commented Jul 26, 2024

Legion: collective instance freeze on slingshot-11 #1729

Legion: collective instance freeze on slingshot-11 #1729

Comments

syamajala commented Jul 26, 2024

elliottslaughter commented Jul 26, 2024

syamajala commented Jul 26, 2024

syamajala commented Jul 26, 2024

syamajala commented Jul 26, 2024 • edited Loading

elliottslaughter commented Jul 26, 2024

syamajala commented Jul 26, 2024

lightsighter commented Jul 26, 2024

syamajala commented Jul 26, 2024 •

edited

Loading