Replies: 1 comment
-
What backend? And yes generation speed should drop significantly |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I know the generation speed should slow down as the context starts to fill up, as LLMs are autoregressive. However, should the drop in speed be as severe as I am experiencing? I can't imagine running models at 32k or longer context sizes if the slowdown is already so substantial at sub 8k levels:
I am running llama.cpp pulled 3 days ago on my 7900xtx through the following command:
llama-server --port 8999 -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 999 --ctx-size 8192 -fa
Beta Was this translation helpful? Give feedback.
All reactions