You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Has the Ray solution been tested with other supported ways to run TorchX-Ray? For example, perhaps on minikube or a traditional HPC-like system.
The dist.ddp component has to work with all supported schedulers. I suspect the Ray on OCP solution does not support all the schedulers. One option would be to have it be a standalone custom component: Rename it to something like rayocp.ddp, move it to a separate branch and not upstream it to pytorch, follow steps for registering custom components, update any internal docs. Alternatively, you would need to test it with all the schedulers that support dist.ddp currently: local, docker, Kubernetes (volcano on plain kubernetes), Kubernetes-MCAD (plain kubernetes or OCP), Slurm, AWS Batch, LSF, and GCP Batch
Make sure it passes the Ray scheduler test: torchx/torchx/schedulers/test/ray_scheduler_test.py
After the steps above, make sure it passes the torchx/scripts/lint.sh and torchx/scripts/pyre.sh
In order to contribute to TorchX, you also need to have a signed CLA in place. For IBM Research, I had to have my github ID added at a corporate agreement level. I am not sure if there is a process in place from the Red Hat side or if you can sign it individually.
Motivation/Background
We don't want to have to maintain a fork of torchx and it would also be nice if torchx works on OpenShift by default
Alternatives
If we can't get the changes in an acceptable state so that they're accepted back upstream then we will have to continue maintaining the changes here indefinitely.
Additional context/links
Will update with upstream PR
The text was updated successfully, but these errors were encountered:
Description
In order to contribute to TorchX, you also need to have a signed CLA in place. For IBM Research, I had to have my github ID added at a corporate agreement level. I am not sure if there is a process in place from the Red Hat side or if you can sign it individually.
Motivation/Background
We don't want to have to maintain a fork of torchx and it would also be nice if torchx works on OpenShift by default
Alternatives
If we can't get the changes in an acceptable state so that they're accepted back upstream then we will have to continue maintaining the changes here indefinitely.
Additional context/links
Will update with upstream PR
The text was updated successfully, but these errors were encountered: