This repository has been archived by the owner on May 28, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 94
[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125
Comments
Hi, please provide repo step if possible, so that our team can help to take a look! |
|
I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue? |
alanwguo
pushed a commit
that referenced
this issue
Jan 25, 2024
Signed-off-by: Max Pumperla <[email protected]>
@sihanwang41 an update on investigating this issue? |
FWIW, ray-llm is not deployable in the current state on images >= |
+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running) |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I'm stuck in a repeat deployment loop when utilizing the image
anyscale/ray-llm:latest
on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.This caused a repeated loop for deploying and shutting down workers.
Possibly due to the CUDA updates, but I'm not 100% sure?
anyscale/ray-llm:0.4.0
launches as expected with no configuration changes.The text was updated successfully, but these errors were encountered: