Why is vLLM CPU backend using oneDNN kernels? #10694

sanketkaleoss · 2024-11-27T03:59:04Z

sanketkaleoss
Nov 27, 2024

I was reviewing the logs of the kernels being called during vLLM CPU inference and noticed that it invokes CPU kernels written in C++ with intrinsics. However, the majority of CPU utilization is attributed to OpenBLAS and oneDNN. My question is: what component is responsible for calling oneDNN kernels, and why are the C++ kernels necessary if Torch is managing everything?

@bigPYJ1151, could you please explain this behavior? I need this clarification to optimize ARM CPU inference performance.

Answered by bigPYJ1151

Nov 27, 2024

There are two components using oneDNN.

nn.linear on CPU is using oneDNN by default.
vllm int8 models require INT8 GEMM kernel, the cuda backend is based on cutlass and the cpu backend is based on oneDNN.

View full answer

bigPYJ1151 · 2024-11-27T05:09:02Z

bigPYJ1151
Nov 27, 2024

There are two components using oneDNN.

nn.linear on CPU is using oneDNN by default.
vllm int8 models require INT8 GEMM kernel, the cuda backend is based on cutlass and the cpu backend is based on oneDNN.

8 replies

bigPYJ1151 Nov 27, 2024

From my experience the most hotspot operations are linear layer and paged attention. The linear layer is computation-bounded and requires high FLOPS FMA instruction. The paged attention is extremely memory-bounded and memory access optimization is more important.

sanketkaleoss Nov 27, 2024
Author

Just one more question: the CPU utilization of paged_attention_v1 is less than 1%, whereas matmul takes up the majority of the utilization. Why does the paged_attention kernel account for such a small portion of the utilization?

bigPYJ1151 Nov 27, 2024

I'm not sure about your profiling method. If using benchmark_throughput with llama3-8b model and 1000 input prompts, on a 32c x86-AMX platform, the time consumptions of linear and paged attention are ~60% and ~25%. If your CPU has no matrix multiplication accelerator, the linear will consume more time and the ratio of paged attention will be down.

amd-lalithnc Nov 27, 2024

Thanks for the information. I was just wondering if using better intrinsics in CPU kernels would improve overall performance, or if most of the CPU utilization is caused by Torch kernels, in which case it might not matter. I read that most of the kernels will be invoked by torch.compile in vLLM in the future. Could you elaborate on that?

What is the current state of torch.compile and vLLM for CPUs? Any action items on the roadmap or updates related to inductor path of PyTorch?

sanketkaleoss Nov 28, 2024
Author

@bigPYJ1151 any update on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is vLLM CPU backend using oneDNN kernels? #10694

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why is vLLM CPU backend using oneDNN kernels? #10694

sanketkaleoss Nov 27, 2024

Replies: 1 comment · 8 replies

bigPYJ1151 Nov 27, 2024

bigPYJ1151 Nov 27, 2024

sanketkaleoss Nov 27, 2024 Author

bigPYJ1151 Nov 27, 2024

amd-lalithnc Nov 27, 2024

sanketkaleoss Nov 28, 2024 Author

sanketkaleoss
Nov 27, 2024

Replies: 1 comment 8 replies

bigPYJ1151
Nov 27, 2024

sanketkaleoss Nov 27, 2024
Author

sanketkaleoss Nov 28, 2024
Author