Integrate PagedAttention Optimization custom kernel into vLLM #22

lcskrishna · 2024-05-28T14:55:23Z

This PR introduces the custom optimized PagedAttention on ROCm. To use the Custom PagedAttention use the following env variable VLLM_USE_ROCM_CUSTOM_PAGED_ATTN=1.

Currently this PagedAttention kernel is supported with fp16 datatype, with GQA ratio=1 to 16 and Head-size=128 and block-size=16.

Kernel was authored by : @sanyalington

cc: @shajrawi @sunway513

shajrawi

Can we have VLLM_USE_ROCM_CUSTOM_PAGED_ATTN on by default? i.e. need to be set to false to use the old one? That's what we do with Triton vs CK
Can you add the env variable setting to our performance guide? https://github.com/ROCm/vllm/blob/main/ROCm_performance.md

tests/kernels/test_attention_custom.py

sanyalington

Paged attention custom reduction kernel supports a max context length of 16K beyond which it can produce incorrect results. A fix for this and further reduction kernel perf improvements are going to be available in a later drop.

lcskrishna · 2024-05-30T08:49:56Z

@shajrawi added into the documentation and enabled by default. The context length <= 32k support will be added into the next PR after this one.

shajrawi

lgtm

lcskrishna added 9 commits May 23, 2024 03:15

initial commit for v0.4.0 with paged attn optimization

f584588

update the integration code

d35ec81

updates to custom attention kenrel

628a80f

update unit test case for custom

70cd48e

update conditions to pick paged attn v2 vs custom

b4cb354

update env condition

35d0dec

enable more parameters in custom unit testing

e88db4e

update conditions for custom vs v2

219c8bd

update gqa ratio condition for using custom kernel

1ab4d88

lcskrishna requested review from sanyalington, shajrawi and dllehr-amd May 28, 2024 14:56

shajrawi reviewed May 28, 2024

View reviewed changes

sanyalington reviewed May 28, 2024

View reviewed changes

tests/kernels/test_attention_custom.py Show resolved Hide resolved

sanyalington approved these changes May 28, 2024

View reviewed changes

updated docs, cleanup and enabled it by default

c66ee03

lcskrishna added 3 commits May 30, 2024 09:16

cleanup code

0d2cf67

code formatting fixes

afc7dc1

fixes imports for custom paged attn

c774517

shajrawi approved these changes May 30, 2024

View reviewed changes

shajrawi merged commit 87ec0c7 into main May 30, 2024
2 checks passed

gshtras deleted the csrikris_pa_opt_shomy_1_16 branch August 20, 2024 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate PagedAttention Optimization custom kernel into vLLM #22

Integrate PagedAttention Optimization custom kernel into vLLM #22

lcskrishna commented May 28, 2024

shajrawi left a comment

sanyalington left a comment

lcskrishna commented May 30, 2024

shajrawi left a comment

Integrate PagedAttention Optimization custom kernel into vLLM #22

Integrate PagedAttention Optimization custom kernel into vLLM #22

Conversation

lcskrishna commented May 28, 2024

shajrawi left a comment

Choose a reason for hiding this comment

sanyalington left a comment

Choose a reason for hiding this comment

lcskrishna commented May 30, 2024

shajrawi left a comment

Choose a reason for hiding this comment