Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #183: Preload models during server initialization to prevent request timeouts #227

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Cola-Rex
Copy link

@Cola-Rex Cola-Rex commented Oct 9, 2024

Fix issue #183 : Pre-download models during server initialization to prevent HTTP timeouts

This commit moves the model downloading logic from the chat_completion method to the initialize method in OllamaInferenceAdapter. By pre-loading required models during server startup, we ensure that large models (e.g., 16GB) are downloaded before serving requests, thus preventing HTTP request timeouts and aborted downloads during the first inference request.

Closes #183

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2024
Copy link
Contributor

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the PR title?

What is the behavior when client sends a request while the server is still starting up?

@Cola-Rex Cola-Rex changed the title Fix issue #183 Fix issue #183 :Preload models during server initialization to prevent request timeouts Oct 10, 2024
@Cola-Rex
Copy link
Author

Thank you for your feedback! I've updated the PR title as requested.

Regarding your question about the behavior when a client sends a request while the server is starting up:

The server currently initializes and preloads the models before accepting any incoming requests.
This means that if a client tries to send a request during the initialization phase, they will receive a connection error because the server is not yet listening on the port.
Once initialization is complete and the server is running, it will accept and process client requests as usual.
This approach ensures that all required models are loaded and ready, preventing any request failures or timeouts due to missing models.

@Cola-Rex Cola-Rex changed the title Fix issue #183 :Preload models during server initialization to prevent request timeouts Fixes #183: Preload models during server initialization to prevent request timeouts Oct 10, 2024
Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on #183

We thought long and hard about this and went through a series of changes. Where we landed at is the following:

  • a user should be loading and running the ollama server before running the llama stack. this is as it works for any other provider (e.g., if you have a remote provider like fireworks surely we can't be in the business of taking over their lifecycle.)
  • when the stack starts, it consults the provider throough a list_models() method to know which models it is running. for ollama, we invoke ollama.client.ps() for this. only these models are registered and available for serving (naturally!)
  • registering of new models dynamically is not permitted right now, although we could make that work easily enough (by calling ollama.client.run()) but then it would have to be done by the user explicitly through the models/register llama stack API.

You can see this here: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py#L61-L85

@ashwinb
Copy link
Contributor

ashwinb commented Oct 10, 2024

The most important part of the comment above is: the Llama Stack cannot and will not take responsibility of managing the "operations" or "lifecycle" of a provider (which we consider to be remote.) Preloading is like that. We have introduced /models/register as a generic API and as I commented above, should the user request it we can make the register_model() method on ollama.py call client.run() to make this possible and it will do what you perhaps wanted to accomplish?

@Cola-Rex Cola-Rex reopened this Oct 11, 2024
@Cola-Rex
Copy link
Author

Hello,

Thank you for your detailed feedback and for directing me to your comment on #183.

I understand that Llama Stack should not manage the operations or lifecycle of providers like Ollama, and that preloading models is considered part of that lifecycle management. Based on your guidance, I've updated the PR accordingly:

Removed model preloading during server initialization:

The adapter no longer attempts to preload models or manage the provider's lifecycle.
Users are expected to start the Ollama server and load any required models independently before running Llama Stack.
Implemented the register_model() method:

The register_model() method in ollama.py now calls client.pull() to allow users to explicitly register and load models via the /models/register API.
This enables dynamic model registration when requested by the user, aligning with the approach you've outlined.
Updated the adapter to align with project guidelines:

The adapter assumes that the Ollama server is running and only interacts with it through defined APIs.
It no longer manages any operations related to the provider's lifecycle.
I believe these changes address your concerns and align with the project's design principles. Please let me know if there's anything I've missed or if further adjustments are needed.

Thank you for your guidance and support!

@Cola-Rex Cola-Rex requested a review from ashwinb October 14, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ollama inference should verify models are downloaded before serving
4 participants