-
-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Automatic max-model-len or max-num-seqs #1077
Comments
I'm not sure I understand the issue. If you need the engine to limit the max_model_len to the amount your GPU can fit, then we already handle that, as you showed yourself in the logs. The "Maximum sequence length allowed in the cache" value reported in the logs is somewhat misleading, as it's a theoretical maximum based on the number of GPU blocks that could be allocated after the initial model profiling. This number isn't reliable either, and can vary significantly depending on your initial
This is why you might see different "Maximum sequence length allowed" values when starting the engine with different As for automatically determining the optimal balance between As an aside, you can try |
Well, yes and no. If I set the model len to slightly higher than would actually fit, then it is reduced. If i start the engine without specifying length, it is taken from config.json and then the engine dies without reducing:
Thank you for this. This is explaining what i was seeing and i found no explanation for this. But this also illustrates my need very well. So, i take a model and load it up. I have no idea how much model-len i can use. I must enter something because entering nothing ends in failure. So i enter 1024. Upon launching i see that, wow, more is available. I see that 64k is available, i raise it to 32k and set max-num-seq to 2. When launching i am shown that now there is only 12k available and we have to reduce it. I need minimum 16k however, so i close and adjust max-num-seqs to 1 and model-len to 16k. This results in engine reporting that 28k is available. So it is a sort of cat and mouse game.
Agreed. But i do not need you to determine the balance automatically. I will supply you balance by suppying you one of them, you determine the other based on what is possible. Use case: I need to test different models and different settings (kv cache types and different -q quants) for models for best throughput. All parameters change available memory. My requirement would be to have let's say 2048 model-len, everything else should be dedicated to max-num-seqs. Like you also said, understanding how much cache really is available depends on how you launch it so this ends up being just launching many times until good enough situation is reached. |
🚀 The feature, motivation and pitch
Feature Request:
Automatically Adjust --max-model-len and --max-num-seqs Based on GPU Memory, Cache Size, and Other Parameters
Problem to Solve:
Currently, maximizing GPU memory usage in Aphrodite-engine requires trial and error to determine an appropriate balance between model length (--max-model-len) and the number of sequences (--max-num-seqs). This process involves multiple launches of the engine to assess cache availability after model loading, as well as determining how many sequences can be supported for a reasonable model length (e.g., 4096).
When starting the Aphrodite-engine, the log provides helpful information:
However, if the model length is set slightly higher than the cache allows, the engine adjusts it automatically:
This behavior suggests the engine can determine cache capacity and adjust settings dynamically. Yet, if no model length is specified, the engine defaults to the value in config.json (e.g., 131000), which may exceed the available memory and result in a CUDA OOM error.
Proposed Solution:
Upon loading the model, the engine should automatically limit --max-model-len to the highest value supported by the available GPU memory and cache size, factoring in parameters like -gmu and the total GPU memory size.
If --max-num-seqs is specified, the engine could divide the available cache proportionally to maximize GPU utilization while maintaining safe operation.
Alternatively, if --max-model-len is specified, the engine should calculate the maximum number of sequences (--max-num-seqs) that can safely fit in the cache.
This would eliminate the need for manual trial and error, making the engine more user-friendly and efficient.
Additional Context:
The ability to automatically balance these parameters seems feasible given the log output and current error-handling mechanisms. However, if there are technical constraints or complexities preventing this, clarification on the challenges involved would be helpful.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: