Add documentation for JAXJob #3877

sandipanpanda · 2024-09-23T21:00:17Z

Ref: kubeflow/training-operator#1619 kubeflow/training-operator#2145

google-oss-prow · 2024-09-23T21:00:28Z

Hi @sandipanpanda. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sandipanpanda · 2024-09-23T21:01:17Z

/area gsoc

/cc @tenzen-y @andreyvelich @terrytangyuan

Arhell

/ok-to-test

andreyvelich

Thank you for doing this @sandipanpanda!
Let's merge it once we complete the Jax implementation in Training Operator.
/hold
/assign @kubeflow/wg-training-leads @StefanoFioravanzo @hbelmiro for review

hbelmiro

/lgtm

hbelmiro

/lgtm

andreyvelich

Thank you for adding this @sandipanpanda!
I left a few comments.

andreyvelich · 2024-09-30T21:49:34Z

content/en/docs/components/training/user-guides/jax.md

+to run JAX training jobs on Kubernetes. The Kubeflow implementation of
+the `JAXJob` is in the [`training-operator`](https://github.com/kubeflow/training-operator).
+
+The current custom resource for JAX has been tested to run multiple processes on CPUs using [gloo](https://github.com/facebookincubator/gloo) for communication between CPUs.


@kubeflow/wg-training-leads @sandipanpanda Do we want to mention that we are looking for user feedback to run JAXJob on TPUs ?

User feedback would be a great idea. But, even if we get any feedback for the TPU, I'm wondering if we can not implement TPU support in the upstream training-operator because we do not have any verification infrastructure, right?

I feel that it is fine for now, since we don't have infra for GPUs today as well.
E.g. we say that you can run those examples on GPUs, but we don't validate them in our CI.

I feel that it is fine for now, since we don't have infra for GPUs today as well.
E.g. we say that you can run those examples on GPUs, but we don't validate them in our CI.

Yeah, that's true, but GPUs are mostly generic devices, and there are many developers who can access them. So, we can improve GPU utilization mechanism based on the GPUs.

OTOH, the TPU is only available in Google Cloud, and there are developers who can access it less than GPU one.
So, my concern is the TPU-specific mechanisms will be abandoned and will not work soon.

andreyvelich · 2024-09-30T21:50:11Z

content/en/docs/components/training/user-guides/jax.md

+## Creating a JAX training job
+
+You can create a training job by defining a `JAXJob` config file. See the manifests for the [simple JAXJob example](https://github.com/kubeflow/training-operator/blob/master/examples/jax/cpu-demo/demo.yaml).
+You may change the config file based on your requirements.


What do you mean by changing the config file here ?

I meant the Job config file, using the wording in the existing user guides. I'll make that clear.

andreyvelich · 2024-09-30T21:52:58Z

content/en/docs/components/training/user-guides/jax.md

+kubectl get pods -n kubeflow -l training.kubeflow.org/job-name=jaxjob-simple
+```
+
+Training takes 5-10 minutes on a CPU cluster. Logs can be inspected to see its training progress.


What do you mean by training if we just compute all-reduce sum across all JAX processes: https://github.com/kubeflow/training-operator/blob/master/examples/jax/cpu-demo/train.py#L39C15-L39C19 ?

I used it in the context of using JAXJob for distributed training. I'll reword it to computation as it makes for sense.

andreyvelich · 2024-09-30T21:53:18Z

content/en/docs/components/training/user-guides/jax.md

+
+```
+PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=jaxjob-simple,training.kubeflow.org/replica-type=worker,training.kubeflow.org/replica-index=0 -o name -n kubeflow)
+kubectl logs -f ${PODNAME} -n kubeflow


Please can you add the output of the logs here.

terrytangyuan

Thank you!

/lgtm

google-oss-prow · 2024-10-08T03:28:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/docs/components/training/OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2024-10-08T17:07:13Z

@sandipanpanda Did you get a chance to review the remaining comments ?

tenzen-y · 2024-10-08T17:29:43Z

content/en/docs/components/training/overview.md

@@ -10,7 +10,7 @@ weight = 10

 The Training Operator is a Kubernetes-native project for fine-tuning and scalable
 distributed training of machine learning (ML) models created with different ML frameworks such as
-PyTorch, TensorFlow, XGBoost, and others.
+PyTorch, TensorFlow, XGBoost, [JAX](https://jax.readthedocs.io/en/latest/), and others.


Suggested change

PyTorch, TensorFlow, XGBoost, [JAX](https://jax.readthedocs.io/en/latest/), and others.

PyTorch, TensorFlow, XGBoost, JAX, and others.

For consistently across all supported frameworks.

tenzen-y · 2024-10-08T17:32:43Z

content/en/docs/components/training/user-guides/jax.md

+the `JAXJob` is in the [`training-operator`](https://github.com/kubeflow/training-operator).
+
+The current custom resource for JAX has been tested to run multiple processes on CPUs using [gloo](https://github.com/facebookincubator/gloo) for communication between CPUs.
+


Additionally, could you mention that the worker with replica 0 is recognized as a JAX coordinator?
IIUC, we do not deploy the dedicated JAX Coordinator replicas, right?

Signed-off-by: Sandipan Panda <[email protected]>

andreyvelich

Looks great, thanks for the update @sandipanpanda 🎉
/lgtm
/hold for others
/assign @kubeflow/wg-training-leads

tenzen-y

/lgtm

andreyvelich · 2024-10-22T17:27:43Z

Thanks for this @sandipanpanda!
/hold cancel

google-oss-prow bot requested review from Jeffwan and johnugeorge September 23, 2024 21:00

google-oss-prow bot added needs-ok-to-test size/L labels Sep 23, 2024

google-oss-prow bot requested review from andreyvelich, tenzen-y and terrytangyuan September 23, 2024 21:01

google-oss-prow bot added the area/gsoc label Sep 23, 2024

Arhell reviewed Sep 24, 2024

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Sep 24, 2024

andreyvelich reviewed Sep 24, 2024

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Sep 24, 2024

hbelmiro reviewed Sep 24, 2024

View reviewed changes

google-oss-prow bot assigned hbelmiro Sep 24, 2024

google-oss-prow bot added the lgtm label Sep 24, 2024

sandipanpanda force-pushed the add-jax-doc branch from 5ffa5a4 to 36367d6 Compare September 24, 2024 16:06

google-oss-prow bot removed the lgtm label Sep 24, 2024

sandipanpanda force-pushed the add-jax-doc branch from 36367d6 to 788422a Compare September 24, 2024 16:16

sandipanpanda requested review from andreyvelich and hbelmiro September 25, 2024 19:03

sandipanpanda mentioned this pull request Sep 25, 2024

[GSoC] Project 5: Integrate JAX with Kubeflow Training Operator to Support JAX Distributed Processes kubeflow/training-operator#2145

Open

hbelmiro reviewed Sep 26, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Sep 26, 2024

andreyvelich reviewed Sep 30, 2024

View reviewed changes

terrytangyuan approved these changes Oct 8, 2024

View reviewed changes

google-oss-prow bot assigned terrytangyuan Oct 8, 2024

google-oss-prow bot added the approved label Oct 8, 2024

tenzen-y reviewed Oct 8, 2024

View reviewed changes

Add documentation for JAXJob

26a6d88

Signed-off-by: Sandipan Panda <[email protected]>

sandipanpanda force-pushed the add-jax-doc branch from 788422a to 26a6d88 Compare October 16, 2024 15:43

google-oss-prow bot removed the lgtm label Oct 16, 2024

sandipanpanda requested review from tenzen-y and andreyvelich October 16, 2024 15:44

sandipanpanda mentioned this pull request Oct 16, 2024

Add sandipanpanda as a Kubeflow Member kubeflow/internal-acls#718

Merged

andreyvelich reviewed Oct 16, 2024

View reviewed changes

google-oss-prow bot assigned andreyvelich Oct 16, 2024

google-oss-prow bot added the lgtm label Oct 16, 2024

tenzen-y reviewed Oct 22, 2024

View reviewed changes

google-oss-prow bot assigned tenzen-y Oct 22, 2024

google-oss-prow bot removed the do-not-merge/hold label Oct 22, 2024

google-oss-prow bot merged commit 463843e into kubeflow:master Oct 22, 2024
6 checks passed

sandipanpanda deleted the add-jax-doc branch October 22, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for JAXJob #3877

Add documentation for JAXJob #3877

sandipanpanda commented Sep 23, 2024

google-oss-prow bot commented Sep 23, 2024

sandipanpanda commented Sep 23, 2024 •

edited

Loading

Arhell left a comment

andreyvelich left a comment

hbelmiro left a comment

hbelmiro left a comment

andreyvelich left a comment

andreyvelich Sep 30, 2024

tenzen-y Oct 8, 2024

andreyvelich Oct 8, 2024 •

edited

Loading

tenzen-y Oct 22, 2024

andreyvelich Sep 30, 2024

sandipanpanda Oct 16, 2024

andreyvelich Sep 30, 2024

sandipanpanda Oct 16, 2024

andreyvelich Sep 30, 2024

terrytangyuan left a comment

google-oss-prow bot commented Oct 8, 2024

andreyvelich commented Oct 8, 2024

tenzen-y Oct 8, 2024

tenzen-y Oct 8, 2024

andreyvelich left a comment

tenzen-y left a comment

andreyvelich commented Oct 22, 2024

	PyTorch, TensorFlow, XGBoost, [JAX](https://jax.readthedocs.io/en/latest/), and others.
	PyTorch, TensorFlow, XGBoost, JAX, and others.

		the `JAXJob` is in the [`training-operator`](https://github.com/kubeflow/training-operator).

		The current custom resource for JAX has been tested to run multiple processes on CPUs using [gloo](https://github.com/facebookincubator/gloo) for communication between CPUs.

Add documentation for JAXJob #3877

Add documentation for JAXJob #3877

Conversation

sandipanpanda commented Sep 23, 2024

google-oss-prow bot commented Sep 23, 2024

sandipanpanda commented Sep 23, 2024 • edited Loading

Arhell left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

hbelmiro left a comment

Choose a reason for hiding this comment

hbelmiro left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Oct 8, 2024

andreyvelich commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

andreyvelich commented Oct 22, 2024

sandipanpanda commented Sep 23, 2024 •

edited

Loading

andreyvelich Oct 8, 2024 •

edited

Loading