speculative : experiments with Qwen2.5-Coder #10290

ggerganov · 2024-11-14T09:42:02Z

While fixing #10289 and prompted by #5877 (comment) I did some tests with the new Qwen2.5-Coder models. I think the speculative approach can be viable with the following settings:

Large draft size (>= 16)
Draft only very high-probability tokens. Otherwise stop the draft early.
Don't evaluate draft batches with less than 4 tokens (no need to waste compute)

With these changes, typical coding assistance seems to benefit, since code blocks are very efficiently speculated and at the same time during free-form text generation, we don't waste time on speculating.

./llama-speculative \
    -m  models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md models/qwen2.5-0.5b-coder-instruct/ggml-model-q4_0.gguf \
    -f ./test.txt -c 8192 -ngl 99 -ngld 99 --draft 32 --color \
    --sampling-seq k --top-k 1 --temp 0.0 -fa

steampunque · 2024-11-14T13:50:56Z

Draft size of 16 should make it viable to copy layers of the target model to GPU dynamically during batch processing/logit computation phase for the target, eliminating the problem of needing large GPU memory to offload the target model. Fabrice Bellard recently added CPU offload to textsynth this way and he says its still fast with large batch sizes though I have not had a chance to test it yet. Very rough testing on his GPU offload mode show latest textsynth to be about 3x faster than llama.cpp on prompt processing on a 4070 (will be doing some detailed bench comparisons soon). Fabrice also does not have KV cache in GPU he keeps it in CPU avoiding any need for large GPU memory for large KV cache which is a big advantage (while keeping it fast).

For my use cases (greedy sampling only) speculative can be greatly simplified by only accepting greedily sampled matches of draft and target. If I could figure out how to implement fabrice dynamic layer swap inference scheme in llama.cpp (not sure if the new backend scheme would allow me to create a new cuda backend that works like that) I think it would open up large models on llama.cpp with consumer grade GPU, currently not viable since any layers not offloaded gives 10x or higher slowdown, though RPC helps it would not be needed with dynamic later offload due to serial nature of layer processing.

Mushoz · 2024-11-14T20:56:24Z

For what it's worth, I am seeing a speedup from ~24 tokens / second with the 32b coder model to 55-60 tokens/second with --draft 10 using the 0.5B model as the draft model. That is with the current llama-cpp codebase. Eg, without the changes in this PR.

I did have to change:

#define SPEC_VOCAB_MAX_SIZE_DIFFERENCE 100

to

#define SPEC_VOCAB_MAX_SIZE_DIFFERENCE 150

As the vocab size difference between the 32B and 0.5B model is 128. Given the massive speed gains I am seeing, the current 100 limit is probably too restrictive.

Can't wait to see this available in the server!

Mushoz · 2024-11-14T20:58:45Z

By the way, those 2 changes you suggested:

Draft only very high-probability tokens. Otherwise stop the draft early.
Don't evaluate draft batches with less than 4 tokens (no need to waste compute)

Are those going to be optional?

Mushoz · 2024-11-14T21:19:55Z

Seeing around 45 tokens / second with the code from the PR's branch, so the original was performing better in my testing. Using a 7900 xtx in case that is relevant.

steampunque · 2024-11-14T21:45:32Z

Seeing around 45 tokens / second with the code from the PR's branch, so the original was performing better in my testing. Using a 7900 xtx in case that is relevant.

Its going to go backward in a big way with too much spec as there is a lot of overhead in computing all the logits on the target which get thrown out if the draft was not accurate. As far as I understand I don't think there is a way to do a decode on a parallel batch (a batch of the drafted samples on the target), then later decided if you want to then crank out the logits on a token by token basis so its necessary to crank out all the logits in parallel on the decode which is a lot of overhead if they often all or mostly get tossed.

spec sampling for Humaneval (code) was covered in https://arxiv.org/pdf/2302.01318 and they didnt find diminishing returns all the way up to 7 tokens but the efficiency curve is flattening out already at about 6 tokens. I would be very surprised to see efficiency gains at much more than that. Still even at 6 tokens there might be some small hope that the compute time of the batch is similar to a CPU->GPU layer weight copy to enable dynamic load of the weights which I see as the real big potential benefit of the spec sample approach (required to fit draft and one layer of target in GPU only), since it would get rid of all CPU compute completely for models which dont fit fully into GPU mem and then its 10x and higher speedups inherently.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (1)

examples/speculative/speculative.cpp: Language not supported

speculative : experimenting with Qwen2.5

5e6dad9

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Nov 14, 2024

github-actions bot added the examples label Nov 14, 2024

ggerganov requested a review from Copilot November 15, 2024 20:05

Copilot AI reviewed Nov 15, 2024

View reviewed changes

ggerganov mentioned this pull request Nov 17, 2024

speculative : refactor and add a simpler example #10362

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speculative : experiments with Qwen2.5-Coder #10290

speculative : experiments with Qwen2.5-Coder #10290

ggerganov commented Nov 14, 2024

steampunque commented Nov 14, 2024 •

edited

Loading

Mushoz commented Nov 14, 2024 •

edited

Loading

Mushoz commented Nov 14, 2024

Mushoz commented Nov 14, 2024

steampunque commented Nov 14, 2024

speculative : experiments with Qwen2.5-Coder #10290

Are you sure you want to change the base?

speculative : experiments with Qwen2.5-Coder #10290

Conversation

ggerganov commented Nov 14, 2024

steampunque commented Nov 14, 2024 • edited Loading

Mushoz commented Nov 14, 2024 • edited Loading

Mushoz commented Nov 14, 2024

Mushoz commented Nov 14, 2024

steampunque commented Nov 14, 2024

Choose a reason for hiding this comment

steampunque commented Nov 14, 2024 •

edited

Loading

Mushoz commented Nov 14, 2024 •

edited

Loading