You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm attempting to use Medusa with TensorRT-LLM to accelerate inference of a fine-tuned Llama 3.1 70B model originally in FP16 precision. To achieve this, I first converted the model to FP8 precision and built it using the following commands:
quantize.py --model_dir=<FINE-TUNED MODEL DIR> --dtype=float16 --tp_size=1 --output_dir=<QUANTIZED MODEL DIR> --qformat=fp8 --kv_cache_dtype=fp8 --calib_dataset=<CALIB DATASET> --calib_size=512 --batch_size=8 --calib_max_seq_length=1024
trtllm-build --checkpoint_dir=<QUANTIZED MODEL DIR> --max_beam_width=1 --max_seq_len=131072 --max_input_len=130560 --max_num_tokens=32768 --max_batch_size=8 --context_fmha=enable --output_dir=<OUT DIR> --use_fp8_context_fmha=disable
I used this FP8 model to distill a dataset and then trained 3 Medusa heads. When evaluated on a validation dataset, the Medusa heads achieved the following token prediction accuracies wrt the tokens generated by the original FP16 fine-tuned model:
TopK=0
> Head 0 Accuracy=0.6837761270606081
> Head 1 Accuracy=0.32617484167971394
> Head 2 Accuracy=0.1807497640902462
TopK=4
> Head 0 Accuracy=0.8547368890673448
> Head 1 Accuracy=0.5451475708643937
> Head 2 Accuracy=0.35749212612899667
These results indicate that the Medusa heads are correctly predicting tokens.
Next, I built an FP8 model with Medusa heads and set max_draft_len=1:
quantize.py --model_dir=<FINE-TUNED MODEL DIR> --dtype=float16 --tp_size=1 --output_dir=<QUANTIZED MODEL DIR> --qformat=fp8 --kv_cache_dtype=fp8 --calib_dataset=<CALIB DATASET> --calib_size=512 --batch_size=8 --calib_max_seq_length=1024 --max_draft_len=1 --num_medusa_heads=3 --num_medusa_layers=1 --medusa_model_dir=<MEDUSA MODEL DIR>
trtllm-build --checkpoint_dir=<QUANTIZED MODEL DIR> --max_beam_width=1 --max_seq_len=131072 --max_input_len=130560 --max_num_tokens=32768 --max_batch_size=8 --context_fmha=enable --output_dir=<OUT DIR> --use_fp8_context_fmha=disable --speculative_decoding_mode=medusa --max_draft_len=1
Running this model built with Medusa and a comparable model built without Medusa in a framework that utilizes TensorRT-LLM's implementation of inflight batching, I observed the following inference p99 latencies:
FP8 model without Medusa: 2.526s
FP8 model with Medusa and medusa_choices="[[0]]": 2.271s
I'm adding the medusa_choices in the code as follows:
decoding_config = trtllm.DecodingConfig()
if medusa_choices is not None:
decoding_config.medusa_choices = ast.literal_eval(medusa_choices)
executor_config = trtllm.ExecutorConfig(
max_beam_width=max_beam_width,
max_batch_size=max_batch_size,
max_num_tokens=max_num_tokens,
batching_type=trtllm.BatchingType.INFLIGHT,
scheduler_config=trtllm.SchedulerConfig(trtllm.CapacitySchedulerPolicy.GUARANTEED_NO_EVICT),
kv_cache_config=kv_cache_config,
decoding_config=decoding_config,
enable_chunked_context=enable_chunked_context,
gpu_weights_percent=1
)
session = trtllm.Executor(model_path, trtllm.ModelType.DECODER_ONLY, executor_config)
When I build a similar engine with max_draft_len=17 and run it with the same medusa_choices="[[0]]", I notice a clear increase in inference latency (p99 of 2.918s). Is this expected behavior due to the increased max_draft_len, even though I'm specifying to use only topk 0 of the first head?
Do you have any benchmarks that demonstrate the overhead introduced by increasing the Medusa choice tree size (and the max_draft_len with it)?
The text was updated successfully, but these errors were encountered:
ValeGian
changed the title
Medusa quality impact + max_draft_len overhead impact
Medusa max_draft_len overhead impact
Nov 27, 2024
Hi @ValeGian
For (1), I think it is expected for the following reason. (It is a matter of compile-time-known vs runtime-known dimension)
For max_draft_len=1, TRT could choose some kernel where there is no need for any overhead (e.g. loop).
vs
For max_draft_len=17 and running with 1 medusa choice, TRT will still need to build an engine that is valid for all draft lengths from 1 to 17. So, it could choose a different kernel with an extra loop and an entirely different optimization strategy that is optimal for all values from 1 to 17, not just 1 as in the previous case. Adding that flexibility and balanced performance across all possible shapes, it can cost some performance.
For (2), unfortunately, we do not have any benchmarks yet that demonstrate the impact of these parameters. Maybe we can add it in the near future.
System Info
CPU architecture: x86_64
GPU: 8 NVIDIA H200
Libraries
TensorRT-LLM: v0.14.0
CUDA: 12.4
NVIDIA driver version: 550.127.05
Setup Info
I'm attempting to use Medusa with TensorRT-LLM to accelerate inference of a fine-tuned Llama 3.1 70B model originally in FP16 precision. To achieve this, I first converted the model to FP8 precision and built it using the following commands:
I used this FP8 model to distill a dataset and then trained 3 Medusa heads. When evaluated on a validation dataset, the Medusa heads achieved the following token prediction accuracies wrt the tokens generated by the original FP16 fine-tuned model:
These results indicate that the Medusa heads are correctly predicting tokens.
Next, I built an FP8 model with Medusa heads and set
max_draft_len=1
:Running this model built with Medusa and a comparable model built without Medusa in a framework that utilizes TensorRT-LLM's implementation of inflight batching, I observed the following inference p99 latencies:
medusa_choices="[[0]]"
: 2.271sI'm adding the
medusa_choices
in the code as follows:and creating the
trtllm.Request
like so:Questions
max_draft_len=17
and run it with the samemedusa_choices="[[0]]"
, I notice a clear increase in inference latency (p99 of 2.918s). Is this expected behavior due to the increasedmax_draft_len
, even though I'm specifying to use only topk 0 of the first head?max_draft_len
with it)?The text was updated successfully, but these errors were encountered: