TensorRT-LLM: is 30-70% faster than llama.cpp on same hardware. #7043
KaelaSavia
started this conversation in
General
Replies: 2 comments
-
It's hard to take a comparison seriously when they start with a 25% smaller model, list that as an advantage, and then never bother to do any quality comparison. They also mention using
Nonetheless, TensorRT is definitely faster than llama.cpp in pure GPU inference, and there are things that could be done to improve the performance of the CUDA backend, but this is not a good comparison. |
Beta Was this translation helpful? Give feedback.
0 replies
-
But TRTLLM doesn't support P40. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I've been seeing news of TensorRT being quite faster and I've been wondering. Any way we can resolve the performance discrepancy.
https://www.reddit.com/r/LocalLLaMA/comments/1cgofop/weve_benchmarked_tensorrtllm_its_3070_faster_on/
Would be quite need to have 70% perf boost ngl
Beta Was this translation helpful? Give feedback.
All reactions