Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speculative : experiments with Qwen2.5-Coder #10290

Draft
wants to merge 1 commit into
base: gg/speculative-fix-oob
Choose a base branch
from

Conversation

ggerganov
Copy link
Owner

While fixing #10289 and prompted by #5877 (comment) I did some tests with the new Qwen2.5-Coder models. I think the speculative approach can be viable with the following settings:

  • Large draft size (>= 16)
  • Draft only very high-probability tokens. Otherwise stop the draft early.
  • Don't evaluate draft batches with less than 4 tokens (no need to waste compute)

With these changes, typical coding assistance seems to benefit, since code blocks are very efficiently speculated and at the same time during free-form text generation, we don't waste time on speculating.

./llama-speculative \
    -m  models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md models/qwen2.5-0.5b-coder-instruct/ggml-model-q4_0.gguf \
    -f ./test.txt -c 8192 -ngl 99 -ngld 99 --draft 32 --color \
    --sampling-seq k --top-k 1 --temp 0.0 -fa

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Nov 14, 2024
@steampunque
Copy link

steampunque commented Nov 14, 2024

Draft size of 16 should make it viable to copy layers of the target model to GPU dynamically during batch processing/logit computation phase for the target, eliminating the problem of needing large GPU memory to offload the target model. Fabrice Bellard recently added CPU offload to textsynth this way and he says its still fast with large batch sizes though I have not had a chance to test it yet. Very rough testing on his GPU offload mode show latest textsynth to be about 3x faster than llama.cpp on prompt processing on a 4070 (will be doing some detailed bench comparisons soon). Fabrice also does not have KV cache in GPU he keeps it in CPU avoiding any need for large GPU memory for large KV cache which is a big advantage (while keeping it fast).

For my use cases (greedy sampling only) speculative can be greatly simplified by only accepting greedily sampled matches of draft and target. If I could figure out how to implement fabrice dynamic layer swap inference scheme in llama.cpp (not sure if the new backend scheme would allow me to create a new cuda backend that works like that) I think it would open up large models on llama.cpp with consumer grade GPU, currently not viable since any layers not offloaded gives 10x or higher slowdown, though RPC helps it would not be needed with dynamic later offload due to serial nature of layer processing.

@Mushoz
Copy link

Mushoz commented Nov 14, 2024

For what it's worth, I am seeing a speedup from ~24 tokens / second with the 32b coder model to 55-60 tokens/second with --draft 10 using the 0.5B model as the draft model. That is with the current llama-cpp codebase. Eg, without the changes in this PR.

I did have to change:

#define SPEC_VOCAB_MAX_SIZE_DIFFERENCE 100

to

#define SPEC_VOCAB_MAX_SIZE_DIFFERENCE 150

As the vocab size difference between the 32B and 0.5B model is 128. Given the massive speed gains I am seeing, the current 100 limit is probably too restrictive.

Can't wait to see this available in the server!

@Mushoz
Copy link

Mushoz commented Nov 14, 2024

By the way, those 2 changes you suggested:

  • Draft only very high-probability tokens. Otherwise stop the draft early.
  • Don't evaluate draft batches with less than 4 tokens (no need to waste compute)

Are those going to be optional?

@Mushoz
Copy link

Mushoz commented Nov 14, 2024

Seeing around 45 tokens / second with the code from the PR's branch, so the original was performing better in my testing. Using a 7900 xtx in case that is relevant.

@steampunque
Copy link

Seeing around 45 tokens / second with the code from the PR's branch, so the original was performing better in my testing. Using a 7900 xtx in case that is relevant.

Its going to go backward in a big way with too much spec as there is a lot of overhead in computing all the logits on the target which get thrown out if the draft was not accurate. As far as I understand I don't think there is a way to do a decode on a parallel batch (a batch of the drafted samples on the target), then later decided if you want to then crank out the logits on a token by token basis so its necessary to crank out all the logits in parallel on the decode which is a lot of overhead if they often all or mostly get tossed.

spec sampling for Humaneval (code) was covered in https://arxiv.org/pdf/2302.01318 and they didnt find diminishing returns all the way up to 7 tokens but the efficiency curve is flattening out already at about 6 tokens. I would be very surprised to see efficiency gains at much more than that. Still even at 6 tokens there might be some small hope that the compute time of the batch is similar to a CPU->GPU layer weight copy to enable dynamic load of the weights which I see as the real big potential benefit of the spec sample approach (required to fit draft and one layer of target in GPU only), since it would get rid of all CPU compute completely for models which dont fit fully into GPU mem and then its 10x and higher speedups inherently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (1)
  • examples/speculative/speculative.cpp: Language not supported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants