Enable continuous batching for ollama: performance improvement #930

jeffbl · 2024-12-16T20:09:35Z

@pefortin messaged to say he's seeing about a 5x total output token rate when enabling continuous batching, although which tools/models he was using is unclear. Nonetheless, it looks like this is now supported in ollama:

ollama/ollama#1396

Work item is to enable this in ollama, then test simulataneous queries to make sure everything works and boosts performance. This will apparently require some additional memory for the input tokens of all the requests, so should be monitored.

jeffbl assigned shahdyousefak Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable continuous batching for ollama: performance improvement #930

Enable continuous batching for ollama: performance improvement #930

jeffbl commented Dec 16, 2024

Enable continuous batching for ollama: performance improvement #930

Enable continuous batching for ollama: performance improvement #930

Comments

jeffbl commented Dec 16, 2024