TensorRT-LLM: is 30-70% faster than llama.cpp on same hardware. #7043

KaelaSavia · 2024-05-02T12:51:09Z

KaelaSavia
May 2, 2024

Hello,

I've been seeing news of TensorRT being quite faster and I've been wondering. Any way we can resolve the performance discrepancy.
https://www.reddit.com/r/LocalLLaMA/comments/1cgofop/weve_benchmarked_tensorrtllm_its_3070_faster_on/

Would be quite need to have 70% perf boost ngl

slaren · 2024-05-02T13:21:06Z

slaren
May 2, 2024
Collaborator

It's hard to take a comparison seriously when they start with a 25% smaller model, list that as an advantage, and then never bother to do any quality comparison. They also mention using batch_size 1, which is concerning to say the least, depending on what they mean with that. Other notes:

Flash attention has been merged this week, and should provide a significant improvement to the prefill time. It needs to be enabled with -fa.
The memory usage comparison is likely the result of using Windows with mmap. Windows does not allow unmapping a file partially, which means that when using mmap the entire model will remain mapped in memory. This is not likely to cause issues since the OS will just drop the model from memory if it is necessary, but the best way to compare memory usage accurately would be to disable mmap.

Nonetheless, TensorRT is definitely faster than llama.cpp in pure GPU inference, and there are things that could be done to improve the performance of the CUDA backend, but this is not a good comparison.

0 replies

vonjackustc · 2024-05-02T15:11:22Z

vonjackustc
May 2, 2024

But TRTLLM doesn't support P40.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM: is 30-70% faster than llama.cpp on same hardware. #7043

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

TensorRT-LLM: is 30-70% faster than llama.cpp on same hardware. #7043

KaelaSavia May 2, 2024

Replies: 2 comments

slaren May 2, 2024 Collaborator

vonjackustc May 2, 2024

KaelaSavia
May 2, 2024

slaren
May 2, 2024
Collaborator

vonjackustc
May 2, 2024