Decoding Attention

Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference. It mainly refers to OpenPPL and Flash Attention, which can solve the problem of low tensor core utilization of Flash Attention in the decoding stage of LLM inference and support more types of attention and kv cache quantization optimization. The calculation expression is as follows, where the precision of tensor Q, K, V and O is FP16 or BF16. In some LLM inference decoding scenarios, the performance of Decoding Attention is better than Flash Decoding (Flash Attention) and FlashInfer. In addition, Decoding Attention also supports variable length, GQA / MQA and ALiBi inference scenarios.

O = Softmax(Q * K^T) * V

Support

Variable Length: Variable kv length inference
GQA / MQA: Group query attention / multi query attention inference
ALiBi: Attention with linear biases inference

Environment

OS: Linux
Cmake Version: >= 3.16
GCC Version: >= 5.0
CUDA Version: >= 11.4
Others: gflags, ccache

sudo apt-get install libgflags-dev ccache

Clone

git clone https://github.com/Bruce-Lee-LY/decoding_attention.git

CPP API

Build

NVIDIA A100

cd decoding_attention
./build_cpp.sh -a 80 -t Release -b OFF
./build_cpp.sh -a 80 -t Debug -b OFF

RTX3080Ti / RTX3090 / RTX A6000

cd decoding_attention
./build_cpp.sh -a 86 -t Release -b OFF
./build_cpp.sh -a 86 -t Debug -b OFF

Test

./run_cpp.sh

Benchmark

./run_cpp.sh

Performance

Process the cpp result in the log and plot it as a line chart.

cd tools/performance/cpp
./performance.sh

Python API

Install

cd decoding_attention
./install_python.sh

Test

./run_python.sh

Benchmark

./run_python.sh

Performance

Process the python result in the log and plot it as a line chart.

cd tools/performance/python
./performance.sh

RTX3090

CUDA Version: 12.1
Head Num: 32
Head Dim: 128
Data Type: FP16

Seq Len

The performance of Decoding Attention is better when the sequence length is below 1536, while the performance of Flash Decoding (Flash Attention) and FlashInfer is better when the sequence length is above 1536.

Batch Size: 1
Seq Q: 1
Seq K: Seq Len

Batch Size

Regardless of bacth size, Decoding Attention has better performance than Flash Decoding (Flash Attention) and FlashInfer.

Batch Size: Batch Size
Seq Q: 1
Seq K: 128

Reference

TODO

Kernel Optimization
KV Cache Quantization: FP8、Int8、Int4

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
csrc		csrc
decoding_attn		decoding_attn
media/images		media/images
performance/RTX3090		performance/RTX3090
tests		tests
tools/performance		tools/performance
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build_cpp.sh		build_cpp.sh
format.sh		format.sh
install_python.sh		install_python.sh
run_cpp.sh		run_cpp.sh
run_python.sh		run_python.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decoding Attention

Support

Environment

Clone

CPP API

Build

NVIDIA A100

RTX3080Ti / RTX3090 / RTX A6000

Test

Benchmark

Performance

Python API

Install

Test

Benchmark

Performance

RTX3090

Seq Len

Batch Size

Reference

TODO

About

Releases

Packages

Languages

License

Bruce-Lee-LY/decoding_attention

Folders and files

Latest commit

History

Repository files navigation

Decoding Attention

Support

Environment

Clone

CPP API

Build

NVIDIA A100

RTX3080Ti / RTX3090 / RTX A6000

Test

Benchmark

Performance

Python API

Install

Test

Benchmark

Performance

RTX3090

Seq Len

Batch Size

Reference

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages