-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speculative : experiments with Qwen2.5-Coder #10290
base: gg/speculative-fix-oob
Are you sure you want to change the base?
Conversation
Draft size of 16 should make it viable to copy layers of the target model to GPU dynamically during batch processing/logit computation phase for the target, eliminating the problem of needing large GPU memory to offload the target model. Fabrice Bellard recently added CPU offload to textsynth this way and he says its still fast with large batch sizes though I have not had a chance to test it yet. Very rough testing on his GPU offload mode show latest textsynth to be about 3x faster than llama.cpp on prompt processing on a 4070 (will be doing some detailed bench comparisons soon). Fabrice also does not have KV cache in GPU he keeps it in CPU avoiding any need for large GPU memory for large KV cache which is a big advantage (while keeping it fast). For my use cases (greedy sampling only) speculative can be greatly simplified by only accepting greedily sampled matches of draft and target. If I could figure out how to implement fabrice dynamic layer swap inference scheme in llama.cpp (not sure if the new backend scheme would allow me to create a new cuda backend that works like that) I think it would open up large models on llama.cpp with consumer grade GPU, currently not viable since any layers not offloaded gives 10x or higher slowdown, though RPC helps it would not be needed with dynamic later offload due to serial nature of layer processing. |
For what it's worth, I am seeing a speedup from ~24 tokens / second with the 32b coder model to 55-60 tokens/second with I did have to change:
to
As the vocab size difference between the 32B and 0.5B model is 128. Given the massive speed gains I am seeing, the current 100 limit is probably too restrictive. Can't wait to see this available in the server! |
By the way, those 2 changes you suggested:
Are those going to be optional? |
Seeing around 45 tokens / second with the code from the PR's branch, so the original was performing better in my testing. Using a 7900 xtx in case that is relevant. |
Its going to go backward in a big way with too much spec as there is a lot of overhead in computing all the logits on the target which get thrown out if the draft was not accurate. As far as I understand I don't think there is a way to do a decode on a parallel batch (a batch of the drafted samples on the target), then later decided if you want to then crank out the logits on a token by token basis so its necessary to crank out all the logits in parallel on the decode which is a lot of overhead if they often all or mostly get tossed. spec sampling for Humaneval (code) was covered in https://arxiv.org/pdf/2302.01318 and they didnt find diminishing returns all the way up to 7 tokens but the efficiency curve is flattening out already at about 6 tokens. I would be very surprised to see efficiency gains at much more than that. Still even at 6 tokens there might be some small hope that the compute time of the batch is similar to a CPU->GPU layer weight copy to enable dynamic load of the weights which I see as the real big potential benefit of the spec sample approach (required to fit draft and one layer of target in GPU only), since it would get rid of all CPU compute completely for models which dont fit fully into GPU mem and then its 10x and higher speedups inherently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot wasn't able to review any files in this pull request.
Files not reviewed (1)
- examples/speculative/speculative.cpp: Language not supported
While fixing #10289 and prompted by #5877 (comment) I did some tests with the new Qwen2.5-Coder models. I think the speculative approach can be viable with the following settings:
With these changes, typical coding assistance seems to benefit, since code blocks are very efficiently speculated and at the same time during free-form text generation, we don't waste time on speculating.