Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contribute torchx changes from our fork back upstream so that we can stop maintaining a custom fork #2

Open
KPostOffice opened this issue Apr 18, 2023 · 1 comment
Assignees

Comments

@KPostOffice
Copy link

Description

  1. Has the Ray solution been tested with other supported ways to run TorchX-Ray? For example, perhaps on minikube or a traditional HPC-like system.
  2. The dist.ddp component has to work with all supported schedulers. I suspect the Ray on OCP solution does not support all the schedulers. One option would be to have it be a standalone custom component: Rename it to something like rayocp.ddp, move it to a separate branch and not upstream it to pytorch, follow steps for registering custom components, update any internal docs. Alternatively, you would need to test it with all the schedulers that support dist.ddp currently: local, docker, Kubernetes (volcano on plain kubernetes), Kubernetes-MCAD (plain kubernetes or OCP), Slurm, AWS Batch, LSF, and GCP Batch
  3. Make sure it passes the Ray scheduler test: torchx/torchx/schedulers/test/ray_scheduler_test.py
  4. After the steps above, make sure it passes the torchx/scripts/lint.sh and torchx/scripts/pyre.sh
    In order to contribute to TorchX, you also need to have a signed CLA in place. For IBM Research, I had to have my github ID added at a corporate agreement level. I am not sure if there is a process in place from the Red Hat side or if you can sign it individually.

Motivation/Background

We don't want to have to maintain a fork of torchx and it would also be nice if torchx works on OpenShift by default

Alternatives

If we can't get the changes in an acceptable state so that they're accepted back upstream then we will have to continue maintaining the changes here indefinitely.

Additional context/links

Will update with upstream PR

@MichaelClifford
Copy link
Collaborator

First PR submitted: torchx/pull/739

@MichaelClifford MichaelClifford moved this from In Progress to Done in Project CodeFlare Sprint Board Aug 28, 2023
@anishasthana anishasthana moved this from Done to In Progress in Project CodeFlare Sprint Board Aug 28, 2023
@KPostOffice KPostOffice moved this from In Progress to Blocked in Project CodeFlare Sprint Board Oct 30, 2023
@KPostOffice KPostOffice moved this from Blocked to In Progress in Project CodeFlare Sprint Board Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants