You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a Llama3 8B model with IQ4_NL quantization. I want tp use this mode to run inference on an edge device (ARM CPU). I am getting a speed of around 2.3 tokens per second (Tested using llama-bench). I have tried --np-mmap and tried on multiple thread. mmap doesn't change the speed much and increasing thread count only improves the speed upto a certain point. Any other arguments or configurations to increase the speed ? I only have a CPU platform and cannot use a GPU for this.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a Llama3 8B model with IQ4_NL quantization. I want tp use this mode to run inference on an edge device (ARM CPU). I am getting a speed of around 2.3 tokens per second (Tested using llama-bench). I have tried --np-mmap and tried on multiple thread. mmap doesn't change the speed much and increasing thread count only improves the speed upto a certain point. Any other arguments or configurations to increase the speed ? I only have a CPU platform and cannot use a GPU for this.
Beta Was this translation helpful? Give feedback.
All reactions