Memory-mapping weights while loading the model #9999

judeharis · 2024-10-22T09:42:19Z

judeharis
Oct 22, 2024

My overall goals to run LLM inference with FPGA-based accelerators using llama.cpp.
Currently I have managed to achieve this, but there is a bottleneck when moving the weight data from normal memory (lpp allocated data struct) to mmapped buffers allocated for DMA data transfers.

My idea is if we could implement way to load the entire or parts of the LLM model into the DMA buffers I could potential make great performance gains when accelerating using DMA stream-based accelerators.

I know currently there is an option to mmap the model but it seems like there is no way specify the physical address I want to associate with the mmapped buffers.

Any advice/ directions on where to look at the code would be helpful.
Also let me know if there is any details I should add to make the discuss more fruitful.

slaren · 2024-10-22T10:22:23Z

slaren
Oct 22, 2024
Collaborator

It's hard to answer this without more details about the way you are implementing this. The backends have full control over how their memory is allocated, so the way your are formulating the question makes me think that you are modifying the CPU backend instead of creating a new backend, which would not be the recommended way to do this.

5 replies

judeharis Oct 22, 2024
Author

Hi, thanks for the reply. This is very good to know, I am been modifying the CPU backend, when I started my project I believe there was no good support for creating new custom backends.

Your reply indicates that now its easier to create a new backend, is there any posts/readmes/examples on this, how I can start.
Note I have done something similar for TFLite using their delegate system to create a custom backends but I was not sure if it was possible with llama.cpp.

slaren Oct 22, 2024
Collaborator

Unfortunately there isn't a lot of documentation. I would recommend looking into the BLAS backend (ggml-blas.h/cpp) since it is one of the simplest backends, and checking the comments in ggml-backend-impl.h to understand the interface that backends need to implement. You can also ask questions here if you find any problems while implementing your backend.

judeharis Oct 22, 2024
Author

Thank you, I will been trying to figure out ggml-backend-impl.h atm, but maybe the blas implementation will make it easier to understand.
By chance do you know when/where different backends are initialised within the ggml/lpp codebase? I am guessing this happens before the model is loaded, but I cant seem it find exactly where. No worries if not.

Will keep looking, and try take a stab at implementing my own backend and come back here for questions.

slaren Oct 22, 2024
Collaborator

The applications can access the backends through the backend registry, and backends are added to the backend registry in the first time it is used. This is done in ggml_backend_registry in ggml-backend.cpp.

judeharis Oct 22, 2024
Author

That's perfect, thanks! I will take a look at the that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-mapping weights while loading the model #9999

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Memory-mapping weights while loading the model #9999

judeharis Oct 22, 2024

Replies: 1 comment · 5 replies

slaren Oct 22, 2024 Collaborator

judeharis Oct 22, 2024 Author

slaren Oct 22, 2024 Collaborator

judeharis Oct 22, 2024 Author

slaren Oct 22, 2024 Collaborator

judeharis Oct 22, 2024 Author

judeharis
Oct 22, 2024

Replies: 1 comment 5 replies

slaren
Oct 22, 2024
Collaborator

judeharis Oct 22, 2024
Author

slaren Oct 22, 2024
Collaborator

judeharis Oct 22, 2024
Author

slaren Oct 22, 2024
Collaborator

judeharis Oct 22, 2024
Author