Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

KPostOffice · 2023-04-18T18:32:17Z

Description

Has the Ray solution been tested with other supported ways to run TorchX-Ray? For example, perhaps on minikube or a traditional HPC-like system.
The dist.ddp component has to work with all supported schedulers. I suspect the Ray on OCP solution does not support all the schedulers. One option would be to have it be a standalone custom component: Rename it to something like rayocp.ddp, move it to a separate branch and not upstream it to pytorch, follow steps for registering custom components, update any internal docs. Alternatively, you would need to test it with all the schedulers that support dist.ddp currently: local, docker, Kubernetes (volcano on plain kubernetes), Kubernetes-MCAD (plain kubernetes or OCP), Slurm, AWS Batch, LSF, and GCP Batch
Make sure it passes the Ray scheduler test: torchx/torchx/schedulers/test/ray_scheduler_test.py
After the steps above, make sure it passes the torchx/scripts/lint.sh and torchx/scripts/pyre.sh
In order to contribute to TorchX, you also need to have a signed CLA in place. For IBM Research, I had to have my github ID added at a corporate agreement level. I am not sure if there is a process in place from the Red Hat side or if you can sign it individually.

Motivation/Background

We don't want to have to maintain a fork of torchx and it would also be nice if torchx works on OpenShift by default

Alternatives

If we can't get the changes in an acceptable state so that they're accepted back upstream then we will have to continue maintaining the changes here indefinitely.

Additional context/links

Will update with upstream PR

MichaelClifford · 2023-07-24T15:50:22Z

First PR submitted: torchx/pull/739

MichaelClifford assigned KPostOffice Apr 24, 2023

MichaelClifford added this to Project CodeFlare Sprint Board Jun 20, 2023

MichaelClifford moved this to Todo in Project CodeFlare Sprint Board Jun 20, 2023

MichaelClifford self-assigned this Jun 29, 2023

MichaelClifford removed the status in Project CodeFlare Sprint Board Jul 9, 2023

tardieu assigned Sara-KS Jul 10, 2023

anishasthana moved this to Todo in Project CodeFlare Sprint Board Jul 10, 2023

MichaelClifford moved this from Todo to In Progress in Project CodeFlare Sprint Board Jul 24, 2023

MichaelClifford moved this from In Progress to Done in Project CodeFlare Sprint Board Aug 28, 2023

anishasthana moved this from Done to In Progress in Project CodeFlare Sprint Board Aug 28, 2023

KPostOffice moved this from In Progress to Blocked in Project CodeFlare Sprint Board Oct 30, 2023

KPostOffice moved this from Blocked to In Progress in Project CodeFlare Sprint Board Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

KPostOffice commented Apr 18, 2023

MichaelClifford commented Jul 24, 2023

Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

Comments

KPostOffice commented Apr 18, 2023

Description

Motivation/Background

Alternatives

Additional context/links

MichaelClifford commented Jul 24, 2023