You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 28, 2024. It is now read-only.
Just curious does ray-llm fully leverage ray serve autoscaling (https://docs.ray.io/en/latest/serve/autoscaling-guide.html)?
Seems ray serve only support target_num_ongoing_requests_per_replica and max_concurrent_queries , As we know, LLM output varies and these are not good for LLM scenarios. how do you achieve better autoscaling support for LLM?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Just curious does ray-llm fully leverage ray serve autoscaling (https://docs.ray.io/en/latest/serve/autoscaling-guide.html)?
Seems ray serve only support
target_num_ongoing_requests_per_replica
andmax_concurrent_queries
, As we know, LLM output varies and these are not good for LLM scenarios. how do you achieve better autoscaling support for LLM?The text was updated successfully, but these errors were encountered: