-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Kueue Job Scheduler #822
base: main
Are you sure you want to change the base?
Conversation
Hi @Bobbins228! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
What does kueue do that volcano doesn't? Is there a reason to support more than one batch scheduler on kubernetes? |
Hi @ccharest93, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quick comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting to see this -- does this also work with the native K8s batch API?
e2de63a
to
d6ccf07
Compare
Looks like there's also valid pyre/lint issues |
b43498f
to
559c19a
Compare
@Bobbins228 there's been a bunch of cleanup on tests in on the |
@Bobbins228 did you try multi-node ddp with it? I think you'd need a JobSet for that which sets up a headless k8s service for workers to reach the master. See example here https://github.com/kubernetes-sigs/jobset/blob/main/examples/pytorch/resnet-cifar10/resnet.yaml Another way to approach this is perhaps to not be tightly coupled with a specific operator but rather just output a k8s standard Job (or even Pod) spec with the container image and allow it to be piped into other tools that can transform it into whatever that's suitable. But AFAICT something like a |
559c19a
to
8e79a51
Compare
@xujyan @Bobbins228 looks like the ddp job on Kueue is having issues |
I would expect that. Like I suggested above you'd need to produce a
where the value is of the format |
And you'd need to install jobset controller in the test env too, see README |
Added a Kubernetes Kueue batch job scheduler based on the Kubernetes Scheduler
Note: variable
local_kueue="local-kueue-name"
is required in the scheduler args for thequeue-name
label and for priority addkueue_priority_class="kueue-priority-class-name"
to the sheduler argsManual Testing
Set up a Kubernetes/Openshift cluster
python3 setup.py bdist_wheel
pip install dist/torchx-0.7.0.dev0-py3-none-any.whl
torchx run --scheduler kueue_job --scheduler_args namespace=default,local_kueue="default-kueue",image_repo="user/alpine" utils.echo --image alpine:latest --msg hello
- should return something likekueue_job://torchx_user/1234
torchx status kueue_job://torchx_user/1234
Suspended
/JobResumed
status"annotations": {"key":"value"}
Integration test
bash setup_minikube_kueue.sh
python scripts/kueue_test.py --container_repo localhost:5000/torchx
python scripts/kueue_test.py --container_repo localhost:5000/torchx --dryrun
Test plan:
Created Unit tests based on Kubernetes Unit tests.