-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #183: Preload models during server initialization to prevent request timeouts #227
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the PR title?
What is the behavior when client sends a request while the server is still starting up?
Thank you for your feedback! I've updated the PR title as requested. Regarding your question about the behavior when a client sends a request while the server is starting up: The server currently initializes and preloads the models before accepting any incoming requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment on #183
We thought long and hard about this and went through a series of changes. Where we landed at is the following:
- a user should be loading and running the ollama server before running the llama stack. this is as it works for any other provider (e.g., if you have a remote provider like fireworks surely we can't be in the business of taking over their lifecycle.)
- when the stack starts, it consults the provider throough a list_models() method to know which models it is running. for ollama, we invoke ollama.client.ps() for this. only these models are registered and available for serving (naturally!)
- registering of new models dynamically is not permitted right now, although we could make that work easily enough (by calling ollama.client.run()) but then it would have to be done by the user explicitly through the models/register llama stack API.
You can see this here: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py#L61-L85
The most important part of the comment above is: the Llama Stack cannot and will not take responsibility of managing the "operations" or "lifecycle" of a provider (which we consider to be remote.) Preloading is like that. We have introduced |
…prevent request timeouts
Hello, Thank you for your detailed feedback and for directing me to your comment on #183. I understand that Llama Stack should not manage the operations or lifecycle of providers like Ollama, and that preloading models is considered part of that lifecycle management. Based on your guidance, I've updated the PR accordingly: Removed model preloading during server initialization: The adapter no longer attempts to preload models or manage the provider's lifecycle. The register_model() method in ollama.py now calls client.pull() to allow users to explicitly register and load models via the /models/register API. The adapter assumes that the Ollama server is running and only interacts with it through defined APIs. Thank you for your guidance and support! |
Fix issue #183 : Pre-download models during server initialization to prevent HTTP timeouts
This commit moves the model downloading logic from the chat_completion method to the initialize method in OllamaInferenceAdapter. By pre-loading required models during server startup, we ensure that large models (e.g., 16GB) are downloaded before serving requests, thus preventing HTTP request timeouts and aborted downloads during the first inference request.
Closes #183