Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable continuous batching for ollama: performance improvement #930

Open
jeffbl opened this issue Dec 16, 2024 · 0 comments
Open

Enable continuous batching for ollama: performance improvement #930

jeffbl opened this issue Dec 16, 2024 · 0 comments
Assignees

Comments

@jeffbl
Copy link
Member

jeffbl commented Dec 16, 2024

@pefortin messaged to say he's seeing about a 5x total output token rate when enabling continuous batching, although which tools/models he was using is unclear. Nonetheless, it looks like this is now supported in ollama:

ollama/ollama#1396

Work item is to enable this in ollama, then test simulataneous queries to make sure everything works and boosts performance. This will apparently require some additional memory for the input tokens of all the requests, so should be monitored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants