How to improve latency? #9703

Abhranta · 2024-10-01T11:28:56Z

Abhranta
Oct 1, 2024

I have a Llama3 8B model with IQ4_NL quantization. I want tp use this mode to run inference on an edge device (ARM CPU). I am getting a speed of around 2.3 tokens per second (Tested using llama-bench). I have tried --np-mmap and tried on multiple thread. mmap doesn't change the speed much and increasing thread count only improves the speed upto a certain point. Any other arguments or configurations to increase the speed ? I only have a CPU platform and cannot use a GPU for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve latency? #9703

{{title}}

Replies: 0 comments

Select a reply

How to improve latency? #9703

Abhranta Oct 1, 2024

Replies: 0 comments

Abhranta
Oct 1, 2024