Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot submit a Ray job to an existing cluster #5877

Open
2 tasks done
gpgn opened this issue Oct 21, 2024 · 3 comments · May be fixed by flyteorg/flytesnacks#1765 or flyteorg/flytekit#2870
Open
2 tasks done

[BUG] Cannot submit a Ray job to an existing cluster #5877

gpgn opened this issue Oct 21, 2024 · 3 comments · May be fixed by flyteorg/flytesnacks#1765 or flyteorg/flytekit#2870
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@gpgn
Copy link

gpgn commented Oct 21, 2024

Describe the bug

Asked to create an issue from this thread. The workaround is simple, but documenting the issue in any case.

I’m setting up the integration with Ray, and it seems to work nicely when creating a fresh RayCluster using: @task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(…)]))).

I can see the cluster starting, the job getting scheduled and distributed, and completing successfully.

I’m having trouble with using an existing RayCluster (in the same cluster) though. From the docs here I read that I should be able to use: @task(task_config=RayJobConfig(address="<RAY_CLUSTER_ADDRESS>")).

However when trying that it seems worker_node_config is a required argument. I tried using an empty list instead:

@task(
    container_image=...,
    task_config=RayJobConfig(
        worker_node_config=[],  # No need to create a Ray cluster but argument is required, maybe just setting to empty list helps?
        address="http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/",  # Tried different ports here as well, like 10001
        runtime_env=...
    ),
)

But then it still tries to start a new RayCluster instead of using the existing one found at address:

❯ k get rayclusters.ray.io -A
NAMESPACE             NAME                                         DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
<flyte-project-<flyte-domain>   ahvfr924w8k2vgvf97wp-n0-0-raycluster-crb9z                                         100m   500Mi    0      ready    2m25s
kuberay               kuberay-cluster                              1                 1                   2      3G       0      ready    3h37m
...

The address works fine if I just run:

k run kuberay-test --rm --tty -i --restart='Never' --image ... --command -- ray job submit --address http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/ -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

It looks like the worker_node_config argument has been required since the initial commit, and we can't seem to find code that submits a job without creating a new cluster. Not sure how the docs example has ever worked?

This seems to work as a simple workaround:

@task(
    container_image=<RAY_IMAGE>,
)
def ray_task_job_submit(n: int) -> typing.List[int]:
    ray.init(address="ray://kuberay-cluster-head-svc.kuberay.svc.cluster.local:10001")
    futures = [f.remote(i) for i in range(n)]
    return ray.get(futures)

Expected behavior

I'm not sure if this is something Flyte wants to support (from the docs it looks like it, but then there is no code to do it)? Either the docs could be updated to remove the option to submit to an existing cluster from the docs, or have the example there work accordingly and not start a new RayCluster and instead submit it to the one existing at address by omitting worker_node_config when doing:

@task(
    container_image=...,
    task_config=RayJobConfig(
        address="http://kuberay-cluster-head-svc.kuberay.svc.cluster.local:8265/",
        runtime_env=...
    ),
)

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@gpgn gpgn added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Oct 21, 2024
@Sovietaced Sovietaced self-assigned this Oct 24, 2024
@Sovietaced Sovietaced added documentation Improvements or additions to documentation and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 24, 2024
@Sovietaced
Copy link
Contributor

I think the easiest thing to do is to rework the documentation to be correct for the current behavior. It looks like updating the module to submit work to an existing cluster would be non-trivial.

@pingsutw
Copy link
Member

@gpgn qq: do you want to use Kuberay to submit a job or use Flyte to create pod that connect to your ray cluster?

@gpgn
Copy link
Author

gpgn commented Nov 3, 2024

@pingsutw We have an existing Ray cluster on our Kubernetes cluster and wanted to try and submit a job to that via Flyte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
3 participants