Replies: 1 comment 5 replies
-
It's hard to answer this without more details about the way you are implementing this. The backends have full control over how their memory is allocated, so the way your are formulating the question makes me think that you are modifying the CPU backend instead of creating a new backend, which would not be the recommended way to do this. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My overall goals to run LLM inference with FPGA-based accelerators using llama.cpp.
Currently I have managed to achieve this, but there is a bottleneck when moving the weight data from normal memory (lpp allocated data struct) to mmapped buffers allocated for DMA data transfers.
My idea is if we could implement way to load the entire or parts of the LLM model into the DMA buffers I could potential make great performance gains when accelerating using DMA stream-based accelerators.
I know currently there is an option to mmap the model but it seems like there is no way specify the physical address I want to associate with the mmapped buffers.
Any advice/ directions on where to look at the code would be helpful.
Also let me know if there is any details I should add to make the discuss more fruitful.
Beta Was this translation helpful? Give feedback.
All reactions