How to make vLLM support a new hardware (Tenstorrent Grayskull) #5627

yocox · 2024-06-18T06:47:20Z

yocox
Jun 18, 2024

Hi, vLLM community

I want to make vLLM support a new hardware, the Tenstorrent's Grayskull (which is a general purpose DLA, just like CUDA, but not CUDA). After reading the document and the code, I have some understanding and some questions, need the community's help to clarify my thoughts and check my understanding. Please correct me if I have any misunderstandings.

My understandings

The essential part of the vLLM is the PagedAttention, which is a highly optimized "memory paging mechanism" implemented on CUDA.
- The CUDA source is at attention_kernel.cu
- The Python binding to expose the kernel to Python is at torch_bindings.cpp
To utilized the Tenstorrent Grayskull, I have to do:
- Implement the PagedAttention with Tenstorrent Grayskull kernel. (that will a huge work)
- Expose the kernel to Python with bindings.
What I DON'T have to do:
- Modify LLM's implementation which already support vLLM, because they are already using the vLLM's interface.

My questions

I saw there are 2 versions of kernels, v1 and v2. Do I need to implement v1, or I can just go with v2?
Where can I find a list of API's that I have to implement? I am afraid I missed anything. In the torch_binding.py I saw there binds a lot of operations, but do I need to implement them all or just the paged_attention_v2()?
Can I first only modify the forward() function to adapt vLLM's interface, without the PagedAttention? will it work but just with worse performance?
Does quantization cause anything special considerations?
Is there are anything I missed but I should know?

Thank you for reading my long questions and thanks in advance for the helping :D

cglagovichTT · 2024-12-09T22:10:59Z

cglagovichTT
Dec 9, 2024

Please give our fork a look! Tenstorrent implemented the paged kernels needed for vLLM https://github.com/tenstorrent/vllm/blob/dev/tt_metal/README.md

0 replies

jeejeelee · 2024-12-10T07:30:20Z

jeejeelee
Dec 10, 2024
Collaborator

There are some RFCs related to hardware support in the issues. You can look into them

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make vLLM support a new hardware (Tenstorrent Grayskull) #5627

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to make vLLM support a new hardware (Tenstorrent Grayskull) #5627

yocox Jun 18, 2024

My understandings

My questions

Replies: 2 comments

cglagovichTT Dec 9, 2024

jeejeelee Dec 10, 2024 Collaborator

yocox
Jun 18, 2024

cglagovichTT
Dec 9, 2024

jeejeelee
Dec 10, 2024
Collaborator