[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

JGSweets · 2024-01-22T18:51:02Z

I'm stuck in a repeat deployment loop when utilizing the image anyscale/ray-llm:latest on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.

This caused a repeated loop for deploying and shutting down workers.
Possibly due to the CUDA updates, but I'm not 100% sure?

anyscale/ray-llm:0.4.0 launches as expected with no configuration changes.

The text was updated successfully, but these errors were encountered:

sihanwang41 · 2024-01-22T19:07:39Z

Hi, please provide repo step if possible, so that our team can help to take a look!

JGSweets · 2024-01-22T19:25:27Z

Update config to match requirements of my AWS env.
- SGs
- region
- updated gpu_worker_g5 to include CPU and GPU values.
Deploy via Ray up
Use Ray attach.
Use rayllm run --model models/continuous_batching/amazon--LightGPT.yaml
- continuous loop on deploy.

JGSweets · 2024-01-22T21:36:13Z

I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue?

Signed-off-by: Max Pumperla <[email protected]>

JGSweets · 2024-02-02T18:23:30Z

@sihanwang41 an update on investigating this issue?

JGSweets · 2024-02-06T18:13:29Z

FWIW, ray-llm is not deployable in the current state on images >= 0.5.0. This is not limited to g5.12xlarges.

SamComber · 2024-03-01T08:50:51Z

+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running)

alanwguo pushed a commit that referenced this issue Jan 25, 2024

contain html lines (#125)

014b7fa

Signed-off-by: Max Pumperla <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

JGSweets commented Jan 22, 2024

sihanwang41 commented Jan 22, 2024

JGSweets commented Jan 22, 2024

JGSweets commented Jan 22, 2024

JGSweets commented Feb 2, 2024

JGSweets commented Feb 6, 2024 •

edited

Loading

SamComber commented Mar 1, 2024

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

Comments

JGSweets commented Jan 22, 2024

sihanwang41 commented Jan 22, 2024

JGSweets commented Jan 22, 2024

JGSweets commented Jan 22, 2024

JGSweets commented Feb 2, 2024

JGSweets commented Feb 6, 2024 • edited Loading

SamComber commented Mar 1, 2024

JGSweets commented Feb 6, 2024 •

edited

Loading