Replies: 2 comments
-
Please give our fork a look! Tenstorrent implemented the paged kernels needed for vLLM https://github.com/tenstorrent/vllm/blob/dev/tt_metal/README.md |
Beta Was this translation helpful? Give feedback.
0 replies
-
There are some RFCs related to hardware support in the issues. You can look into them |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, vLLM community
I want to make vLLM support a new hardware, the Tenstorrent's Grayskull (which is a general purpose DLA, just like CUDA, but not CUDA). After reading the document and the code, I have some understanding and some questions, need the community's help to clarify my thoughts and check my understanding. Please correct me if I have any misunderstandings.
My understandings
PagedAttention
, which is a highly optimized "memory paging mechanism" implemented on CUDA.attention_kernel.cu
torch_bindings.cpp
PagedAttention
with Tenstorrent Grayskull kernel. (that will a huge work)My questions
torch_binding.py
I saw there binds a lot of operations, but do I need to implement them all or just thepaged_attention_v2()
?forward()
function to adapt vLLM's interface, without thePagedAttention
? will it work but just with worse performance?Thank you for reading my long questions and thanks in advance for the helping :D
Beta Was this translation helpful? Give feedback.
All reactions