Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legion: collective instance freeze on slingshot-11 #1729

Open
syamajala opened this issue Jul 26, 2024 · 7 comments
Open

Legion: collective instance freeze on slingshot-11 #1729

syamajala opened this issue Jul 26, 2024 · 7 comments
Labels

Comments

@syamajala
Copy link
Contributor

I believe using collective instances results in a startup freeze on slingshot-11. I have one commit of S3D that uses them (https://gitlab.com/legion_s3d/legion_s3d/-/commit/e797d71367683580933166a0080a3dbf3f98b978) and freezes at startup and another commit (https://gitlab.com/legion_s3d/legion_s3d/-/commit/5455c8c03e67c32f2fcbee1120d7a40c37486823) where I specifically backed out those changes and it no longer freezes. We will probably need to investigate this with HPE.

@syamajala syamajala added the S3D label Jul 26, 2024
@elliottslaughter
Copy link
Contributor

  1. What scale is required to hit this freeze? (Nodes, ranks, GPUs/rank?)
  2. Can you confirm that the collective instance freeze definitely does not occur on at least one other network (e.g., Infiniband)? Ideally at a similar scale to above. This would help rule out a potential Legion issue.

@syamajala
Copy link
Contributor Author

The subranks branch starts to freeze at 8 nodes. The tdb branch starts to freeze at 4 nodes, 4 ranks/node, 2 gpus/rank. I can try tdb on sapling. It was working on blaze the last time I tried, but blaze is currently down.

@syamajala
Copy link
Contributor Author

Actually looks like blaze is back up. Will try it there.

@syamajala
Copy link
Contributor Author

syamajala commented Jul 26, 2024

The network drive on blaze is still down so I couldnt run there, but I built and ran tdb on sapling. Ran on all 4 nodes, 4 ranks/node, 1 gpu/rank. It started up and ran fine. This is probably a slingshot issue.

@elliottslaughter
Copy link
Contributor

Sapling has 4 GPUs per node. Could we run 16 ranks, 4 ranks/node, 1 GPU/rank?

@syamajala
Copy link
Contributor Author

Sorry I meant 4 ranks/node. Not 1 rank/node.

@lightsighter
Copy link
Contributor

I'm going to assume this is a Slingshot issue unless we can reproduce it on an Infiniband machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants