Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added mcad bool for Job submission with no cluster #460

Closed

Conversation

Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Feb 14, 2024

Issue link

Closes RHOAIENG-1055
linked to pytorch/torchx#822

What changes have been made

Added an mcad bool which defaults at False for Job submission.
When mcad is false we use the kueue_job torchx scheduler

Verification steps

  • Follow the testing steps in this PR but stop after installation.
  • Make a change to the pyproject.toml file
    • replace codeflare-torchx = "0.6.0.dev1" with torchx = {path = "/path/to/your/torchx/dist/torchx-0.7.0.dev0-py3-none-any.whl"}
  • run poetry build and pip install --force-reinstall /path/to/your/dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
  • Run the jobs demo notebook with this configuration:
from codeflare_sdk.job.jobs import DDPJobDefinition
jobdef = DDPJobDefinition(
    name="mnistjob",
    script="mnist.py",
    # script="mnist_disconnected.py", # training script for disconnected environment
    scheduler_args={"namespace": "default", "local_queue": "YOUR LOCAL QUEUE"}, # ADD "priority_class":"kueue-priority-class-name" for testing priority
    j="1x1",
    gpu=0,
    cpu=1,
    max_retries=3,
    memMB=8000,
    image="quay.io/mcampbel/mnist-image:v0.0",
    mcad=False
)
job = jobdef.submit()
  • From there you can test logs and status as well as cancel like the mcad version.

Note: If you are testing priority you need a workload priority class and the workload associated with the job should have the priority you set included.

Note 2: This PR is purely for testing the kueue job scheduler but will be converted to a "real" PR for adding Kueue Job submission when the codeflare-sdk torchx is synced with the upstream torchx repo

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2024
Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from bobbins228. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant