From b30dad36ed17b40c79bc9395734eb54808f78cdf Mon Sep 17 00:00:00 2001 From: Zihao Ye Date: Sat, 3 Feb 2024 00:58:32 +0800 Subject: [PATCH] [Doc] Improve README and documentation. (#106) --- README.md | 5 ++++- docs/index.rst | 4 ++-- docs/tutorials/recursive_attention.rst | 21 ++++++++++++--------- 3 files changed, 18 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 692ab2dd..1efb0a82 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,10 @@ Using our PyTorch API is the easiest way to get started: We provide prebuilt wheels for Linux and you can try out FlashInfer with the following command: ```bash -pip install flashinfer -i https://flashinfer.ai/whl/cu121/ # for CUDA 12.1, use cu118 for CUDA 11.8 +# For CUDA 12.1 +pip install flashinfer -i https://flashinfer.ai/whl/cu121/ +# For CUDA 11.8 +# pip install flashinfer -i https://flashinfer.ai/whl/cu118/ ``` or you can build from source: diff --git a/docs/index.rst b/docs/index.rst index d171a797..f13eda9d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -8,7 +8,7 @@ Welcome to FlashInfer's documentation! `Blog `_ | `Discussion Forum `_ | `GitHub `_ -FlashInfer is a library for Language Languages Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-the-art performance across diverse scenarios. +FlashInfer is a library for Language Languages Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios. .. toctree:: :maxdepth: 2 @@ -31,4 +31,4 @@ FlashInfer is a library for Language Languages Models that provides high-perform api/python/prefill api/python/cascade api/python/page - \ No newline at end of file + diff --git a/docs/tutorials/recursive_attention.rst b/docs/tutorials/recursive_attention.rst index ca14717f..260c1ff9 100644 --- a/docs/tutorials/recursive_attention.rst +++ b/docs/tutorials/recursive_attention.rst @@ -1,7 +1,7 @@ .. _recursive-attention: -Attention States and Recursive form of Self-Attention -===================================================== +Attention States and Recursive Attention +======================================== FlashInfer introduces the concept of **attention states**, which fully characterizes @@ -21,23 +21,26 @@ We can also generalize the value on index :math:`i` to index set :math:`I`: .. math:: - \mathbf{v}(I)=\frac{\sum_{i\in I}\exp\left(s_i\right)\mathbf{v}_i}{\exp(s(I))} + \mathbf{v}(I) = \sum_{i\in I}\textrm{softmax}(s_i) \mathbf{v}_i = \frac{\sum_{i\in I}\exp\left(s_i\right)\mathbf{v}_i}{\exp(s(I))} -The *attention state* of the index set :math:`i` can be defined as a tuple :math:`(s(I), \mathbf{v}(I))`. - -Then we can define the **merge** operator :math:`\oplus` of two attention states as: +The :math:`softmax` function is restricted to the index set :math:`I`. Note that :math:`\mathbf{v}(\{1,2,\cdots, n\})` is the self-attention output of the entire sequence. +The *attention state* of the index set :math:`i` can be defined as a tuple :math:`(s(I), \mathbf{v}(I))`, then we can define a binary **merge** operator :math:`\oplus` of two attention states as ((in practice we will minus $s$ with maximum value to guarantee numerical stability and here we omit them for simplicity): .. math:: \begin{bmatrix}\mathbf{v}(I\cup J)\\s(I\cup J)\end{bmatrix}=\begin{bmatrix}\mathbf{v}(I)\\s(I)\end{bmatrix}\oplus\begin{bmatrix}\mathbf{v}(J)\\s(J)\end{bmatrix}=\begin{bmatrix} \frac{\mathbf{v}(I)\exp(s(I)) + \mathbf{v}(J)\exp(s(J))}{\exp(s(I)) + \exp(s(J))} \\ \log(\exp(s(I)) + \exp(s(J))) \end{bmatrix} -The **attention state** on the entire sequence can be defined as: +the **merge** operator can be generalized to any number of attention state inputs: .. math:: + \begin{bmatrix}\mathbf{v}(\bigcup_{i=1}^{n}I_i) \\ s(\bigcup_{i=1}^{n}I_i) \end{bmatrix} = \bigoplus_{i=1}^{n}\begin{bmatrix}\mathbf{v}(I_i) \\ s(I_i)\end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n} \textrm{softmax}(s(I_i))\mathbf{v}(I_i) \\ \log(\sum_{i=1}^{n} \exp (s(I_i))) \end{bmatrix} - \begin{bmatrix}\mathbf{v}(\{1,2,\dots, n\})\\s(\{1,2,\dots, n\})\end{bmatrix} = \bigoplus_{i=1}^{n} \begin{bmatrix}\mathbf{v}_i\\s_i\end{bmatrix} +The above n-ary merge operator is consistent with the binary merge operator, and we can prove the operator is *communicative* and *associative*. There are different ways to get the attention state of the entire sequence by merging the attention states of index subsets, and the final outcome is mathematically equivalent: -Then :math:`\mathbf{v}(\{1,2,\dots, n\})` is the final attention output. +.. image:: https://raw.githubusercontent.com/flashinfer-ai/web-data/main/tutorials/recursive-attention.png + :width: 600 + :align: center + :alt: Recurisve Attention .. note::