Does llama.cpp Support LoRA Adapters in GGUF #7785

rmarnold · 2024-06-06T00:42:21Z

rmarnold
Jun 6, 2024

I know llama.cpp support GGML but I was wonder if GGUF is supported.

teleprint-me · 2024-06-06T05:14:51Z

teleprint-me
Jun 6, 2024

It currently only supports "in-house" loras.

https://github.com/ggerganov/llama.cpp/tree/master/examples/finetune

I missed this one. So much going on in this project. 😅

https://github.com/ggerganov/llama.cpp/tree/master/examples/export-lora

Don't be shy. Poke around the examples. That's the whole point.

0 replies

chrisalbertson · 2024-06-07T21:55:38Z

chrisalbertson
Jun 7, 2024

Here is a copy/paste from my terminal session...

chris@Chris-Mac-mini llama.cpp % ./main --help | grep LoRA
--lora FNAME apply LoRA adapter (implies --no-mmap)
--lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap)
--lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter
chris@Chris-Mac-mini llama.cpp %

0 replies

rmarnold · 2024-06-07T23:15:39Z

rmarnold
Jun 7, 2024
Author

Thanks for the response guys!

I have been creating LoRA adapters with mlx_lm.lora but the output is in safetensors and sense the convert-lora-to-ggml.py script has been dropped from the project, I opened a request in the mlx project to export LoRAs as ggml - they were asking if llama.cpp supported LoRA in gguf - I think because they already can merge/fuse base models and lora adapters in the gguf format.

I was able to modify an old copy of convert-lora-to-ggml.py to export the safetensors as ggml - now I see why it was dropped, it looks like layer names and formats keep changing.

./server -m models/Phi-3-mini-128k-instruct-Q4_K_M.gguf -c 32768 --parallel 1 --threads-http 2 --cont-batching --n-gpu-layers -1 --flash-attn --metrics --mlock --no-mmap --n-predict 512 --lora-scaled adapters/ggml-adapter-model.bin 0.5 --grammar grammars/json.gbnf --chat-template phi3

1 reply

teleprint-me Jun 8, 2024

Yeah, working on figuring this out. Will take some time. Automating conversions overall would be ideal.

pdevine · 2024-06-22T18:01:20Z

pdevine
Jun 22, 2024

I've run into more or less all of the problems mentioned here. I can load a ggla into memory, but I don't think they're being properly applied. I'm still going through all of the lora code to see if the problem is w/ inference or w/ ggml misapplying the loras during load time.

As an aside, I ended up writing my own npz/safetensor converter -> ggla, but I think one of the problems there is the attn_q layers need to be transposed correctly for llama3 because of the way ggml works w/ llama3. This is going to be more or less the same for loras w/ any model which llama.cpp is expecting to be in a particular format.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does llama.cpp Support LoRA Adapters in GGUF #7785

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Does llama.cpp Support LoRA Adapters in GGUF #7785

rmarnold Jun 6, 2024

Replies: 4 comments · 1 reply

teleprint-me Jun 6, 2024

chrisalbertson Jun 7, 2024

rmarnold Jun 7, 2024 Author

teleprint-me Jun 8, 2024

pdevine Jun 22, 2024

rmarnold
Jun 6, 2024

Replies: 4 comments 1 reply

teleprint-me
Jun 6, 2024

chrisalbertson
Jun 7, 2024

rmarnold
Jun 7, 2024
Author

pdevine
Jun 22, 2024