Support LoRA hotswapping and multiple LoRAs at a time #1817

richdougherty · 2024-10-30T10:34:05Z

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in ggerganov/llama.cpp#8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in ggerganov/llama.cpp#8332 are:

Refactor lora API

Allow hot-swapping lora

Added struct llama_lora_adapter to keep track of loaded lora

I have made some llama-cpp-python changes to enable this support:

Updated C wrappers
Added _internals.LlamaLoraAdapter to wrap llama.cpp's llama_lora_adapter
Modified wrapper lifecycle to free llama_lora_adapter correctly
Added high level API in Llama wrapper - now supports a dict of LoRA adapters to reflect llama.cpp's support for multiple LoRAs; also has method for changing LoRA scales
Updated cache to have LoRA adapter weights in cache keys, because different active LoRAs will have different cache state
Updated server to support hot-swapping LoRAs when a base model is shared

I have an example of usage through the API and via the server here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147#file-lora-md

Example API usage:

>>> import llama_cpp
>>> llm = llama_cpp.Llama("<model>") # Can also add LoRAs in dict here
>>> llm.lora_adapters
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 0.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0}
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_content_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])

richdougherty · 2024-11-02T00:12:15Z

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file

hrsmanian · 2024-11-18T13:35:56Z

This seems to be a cool feature to have. Any idea when this will be available?

richdougherty · 2024-11-19T09:56:41Z

The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.

hrsmanian · 2024-11-22T05:20:23Z

Thanks Rich. Let me know when I can try it out.

richdougherty · 2024-11-24T08:17:27Z

The code is ready for review now, thanks for your patience!

@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147

richdougherty · 2024-11-24T08:19:04Z

examples/low_level_api/common.py

+        dest="lora",
+    )
+
+    class MultiTupleAction(argparse.Action):


Needed this fancy arg parse action to match the llama.cpp argument format which takes two arguments:

https://github.com/ggerganov/llama.cpp/blob/master/common/arg.cpp#L1546-L1551

richdougherty · 2024-11-24T08:19:57Z

examples/low_level_api/low_level_api_chat_cpp.py

-            ):
-                print("error: failed to apply lora adapter")
-                return
+        for lora_path, scale in [(pth, 1.0) for pth in self.params.lora] + self.params.lora_scaled:


I didn't test this extensively, but this code at least worked this far on - the actual example failed later for me for unrelated reasons.

richdougherty · 2024-11-24T08:21:41Z

llama_cpp/_internals.py

+        # when the llama_lora_adapters are freed.
+        def clear_lora_adapter():
+            self.lora_adapter = None
+        self.model._exit_stack.callback(clear_lora_adapter)


This seemed to be a clean way to keep the reference back to the parent LlamaModel up to date.

richdougherty · 2024-11-24T08:23:50Z

llama_cpp/llama.py

@@ -243,7 +242,7 @@ def __init__(
            )  # keep a reference to the array so it is not gc'd
            self.model_params.tensor_split = self._c_tensor_split
        self.model_params.vocab_only = vocab_only
-        self.model_params.use_mmap = use_mmap if lora_path is None else False
+        self.model_params.use_mmap = use_mmap


Memory mapping is supported for LoRAs now because llama.cpp no longer merges the LoRA into the base model, so the original model GGUF is unchanged and can be mapped.

See the equivalent change in llama.cpp:
https://github.com/ggerganov/llama.cpp/pull/8332/files#diff-201cbc8fd17750764ed4a0862232e81503550c201995e16dc2e2766754eaa57aL688

richdougherty · 2024-11-24T08:25:36Z

llama_cpp/llama.py

-
-            self._stack.callback(free_lora_adapter)
+        # Dict from LoRA path to wrapper
+        self._lora_adapters_paths: Dict[str, internals.LlamaLoraAdapter] = {}


The Llama wrapper maintains a map to the low-level wrappers. We use the LoRA adapter path to key these objects. In theory we could track them using some sort of handle object, but this seemed OK for now.

richdougherty · 2024-11-24T08:25:55Z

llama_cpp/llama.py

@@ -174,8 +174,7 @@ def __init__(
            offload_kqv: Offload K, Q, V to GPU.
            flash_attn: Use flash attention.
            last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque.
-            lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
-            lora_path: Path to a LoRA file to apply to the model.
+            lora_adapters: Paths to LoRA adapter files and the scale to apply to them at (scale of 0.0 will not be used during inference).


This is a breaking API change, reflecting the change upstream. Is this OK?

richdougherty · 2024-11-24T08:28:28Z

llama_cpp/llama.py

@@ -453,6 +434,7 @@ def free_lora_adapter():
        self._candidates = internals.LlamaTokenDataArray(n_vocab=self._n_vocab)

        self.n_tokens = 0
+        self.tokens_lora_adapters: Tuple[Tuple[str, float]] = () # Adapters that processed tokens


Tracks the LoRA adapters that were used to generate the n_tokens. A call to reset() sets this to the currently active adapters.

richdougherty · 2024-11-24T08:29:13Z

llama_cpp/llama.py

+        self.lora_adapters[lora_path] = scale
+        self._lora_adapters_active = tuple(sorted(
+            filter(lambda path_scale: path_scale[1] != 0.0, self.lora_adapters.items())
+        ))


Cache a tuple with the active adapters - sorted and filtered so it can be used as a canonical cache key.

richdougherty · 2024-11-24T08:30:53Z

llama_cpp/llama_cache.py

+        if type(key) == LlamaCacheKey:
+            return key
+        else:
+            return LlamaCacheKey(active_lora_adapters=(), tokens=tuple(key))


Provides backwards compatibility for existing code which passes a Sequence[int] to access the cache. This format is still supported, although existing cached values won't be found.

richdougherty · 2024-11-24T08:32:45Z

llama_cpp/server/model.py

+                values.pop('lora_adapters', None) # Different LoRA adapters can be hot-swapped
+                return values
+
+            if hot_swappable_settings(new_settings) == hot_swappable_settings(current_settings):


This fast path lets us avoid loading the model again if it is the same model. If we are able to hot-swap then we can just update the LoRA adapter scales (loading new LoRAs if needed) and then exit early.

hrsmanian · 2024-11-25T12:06:40Z

The code is ready for review now, thanks for your patience!

@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147

Thanks Rich. Seems to work as expected overall.
However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters

niranjanakella · 2024-11-27T07:45:06Z

llama_cpp/llama.py

-        self.lora_scale = lora_scale
-        self.lora_path = lora_path
+        self.lora_adapters = (
+            lora_adapters if lora_adapters is None else {}


@richdougherty this should be changed to the below, if not, even when the lora_adapters dictionary is passed it would be set to {}

lora_adapters if lora_adapters else None

richdougherty · 2024-11-27T16:58:48Z

However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters

That sounds like a memory leak for sure. Can you give me a bit more information about how you're running it? Are you loading (and unloading) LoRA adapters? Are you using the http server or the API? Also any info on how you're measuring memory usage, how fast it is growing, etc might be useful. Thanks!

hrsmanian · 2024-11-27T17:33:25Z

@richdougherty , I am running offline inference in a loop loading one conversation at a time
And in another window, I run nvidia-smi to see the gpu memory usage. It increases by roughly 4Mb for each inference.
I load the model and set the lora as below.

import llama_cpp
import json
import time

llm = llama_cpp.Llama("gguf_models/Llama-3.2-3B-Instruct-f16.gguf", n_gpu_layers=-1, verbose=False, n_ctx=6000)

#loading both the adapters
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 0.0)
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)

infile = "val_English.jsonl"
fp = open (infile, "r")
time1 = 0.0
time2 = 0.0
count = 0
for line in fp:
json_data = json.loads(line.strip())
dialogue = json_data['dialogue']

model_prompt = f"Summarize the text"

s1 = time.time()
############## activate 1st adapter and disable 2nd adapter
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 1.0)
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)
print(f"LoRA State1: {llm.lora_adapters}")

prompt_1 = f"""<|start_header_id|>system<|end_header_id|>

{model_prompt}: {dialogue}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
output = llm.create_completion(seed=12345, temperature=0.01, top_p=0.99, top_k=250, max_tokens=256, prompt=prompt_1, stop=['<|eot_id|>'])
time1 += time.time() - s1
print_out1 = f"""MODEL OUTPUT:\n\n {output['choices'][0]['text']}

Usage: 
Input Tokens: {output['usage']['prompt_tokens']}
Output Tokens: {output['usage']['completion_tokens']}
Total Tokens: {output['usage']['total_tokens']}

"""
print(print_out1)

hrsmanian · 2024-12-06T00:49:29Z

Hi, anymore information needed on this? Kindly let me know

richdougherty · 2024-12-06T23:50:24Z

Hi @hrsmanian, apologies for the time to get back. Based on your explanation - seeing GPU memory usage increasing - it sounds like a leak in the llama.cpp allocated GPU memory. This could indicate either a bug in llama.cpp's LoRA adapter code or - more likely! - a bug in the bindings that I wrote.

A bug in the llama-cpp-python bindings in this PR would be something like incorrectly using the llama.cpp API and therefore causing those extra GPU allocations for LoRA adapters and perhaps forgetting to deallocate somehow.

Unfortunately, I only have a CPU for inference, but I believe I should be able to spot an incorrect usage of the llama.cpp LoRA API by seeing RAM usage increasing for the Python process. This is because llama.cpp will use the process RAM to store the models and adapters when using CPU inference. So because I'm using system RAM then I think that any mistakes with allocating or deallocating LoRA adapters should show up in the Python RAM usage directly as an increase in virtual memory usage.

Note: a Python object memory leak in the wrapper objects might not show up, since the Python heap might not visibly grow with each object due to its garbage-collected nature. But I think that looking at Python process RAM usage should be good enough to show the kind of leak that you're talking about in llama.cpp allocated memory. The memory allocated by llama.cpp won't be using the Python garbage-collected heap for its memory allocation, therefore it should be allocating memory directly which will be visible in the process memory used.

Test script

I've written a little script to test this. I'm using psutil to check the Python process RAM usage after each operation. This works for me for testing a leak when using CPU for inference.

I do not see a leak using this to test. Certainly nothing on the order of 4MB for each inference. Would you mind checking the script on your machine as well? I can also test your script if you want to link to your models / adapters and dataset you use to test, but I understand that these might be private.

You can try it using psutil to view RAM usage. To help with the GPU memory leak debugging perhaps you could edit the script so that it reports GPU usage. I saw there are a couple of libraries that can help with this (not tested by me). These are https://github.com/pmav99/nvsmi or https://pypi.org/project/nvidia-ml-py/ . You could patch the log_mem function to report GPU memory usage. That could give a useful log showing any leaks in GPU memory.

For my tests I am using the model and adapters described in the guide I wrote before: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147 . This test only tests very short prompts and completions. I tried a slightly larger prompt and completion further down in this comment, but perhaps you can test using your dataset as well.

export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
export ADAPTER1_GGUF=./adapters/lora_tldr_headline_gen.gguf 
export ADAPTER2_GGUF=./adapters/lora_tldr_content_gen.gguf

pip install psutil

python memtest.py 2>&1

memtest.py

import os

model_gguf = os.environ['MODEL_GGUF']
adapter1_gguf = os.environ['ADAPTER1_GGUF']
adapter2_gguf = os.environ['ADAPTER2_GGUF']

import psutil

process = psutil.Process()

prev_vms = 0
prev_rss = 0
def log_mem(msg):
  global prev_rss, prev_vms
  pmem = process.memory_info()
  vms = pmem.vms
  rss = pmem.rss
  delta_vms = vms - prev_vms
  delta_rss = rss - prev_rss

  print(f'====== {msg:<40} {vms:>16,} ({delta_vms:>+16,}) {rss:>16,} ({delta_rss:>+16,}) ======')

  prev_vms = vms
  prev_rss = rss

log_mem('initial')

import llama_cpp

log_mem('imported llama_cpp')

llm = llama_cpp.Llama(model_gguf)

log_mem('loaded model')

i = 0
for i in range(0, 100):

  # Create a pattern of enablement so we can see all patterns of enabled/disabled
  # as well has having sequences where no changes happen.
  desired_adapter1_scale = i // 2 % 2 * 1.0 # Enable 2 out of every 4 times
  desired_adapter2_scale = i // 4 % 2 * 1.0 # Enable 4 out of every 8 times

  # Check current state - note that we treat the initial state when they are not
  # loaded as 0.0 to ensure we have a couple of tests without them loaded
  lora_adapters = llm.lora_adapters or {}
  current_adapter1_scale = lora_adapters.get(adapter1_gguf, 0.0)
  current_adapter2_scale = lora_adapters.get(adapter2_gguf, 0.0)

  if current_adapter1_scale != desired_adapter1_scale:
    llm.set_lora_adapter_scale(adapter1_gguf, desired_adapter1_scale)
    log_mem(f'after set adapter 1 scale {desired_adapter1_scale}')
  if current_adapter2_scale != desired_adapter2_scale:
    llm.set_lora_adapter_scale(adapter2_gguf, desired_adapter2_scale)
    log_mem(f'after set adapter 2 scale {desired_adapter2_scale}')

  llm.create_completion(seed=12345, temperature=0, max_tokens=16, prompt=str(i))
  log_mem(f'after completion "{i}"')

When I run this I see initial allocations in virtual memory (first column) but it stays stable after the adapters have been loaded. The RAM usages stays the same after various loads and unloads.

python memtest.py 2>/dev/null

====== initial                                        36,880,384 (     +36,880,384)       19,529,728 (     +19,529,728) ======
====== imported llama_cpp                            314,400,768 (    +277,520,384)       42,971,136 (     +23,441,408) ======
====== loaded model                                4,740,259,840 (  +4,425,859,072)    4,262,637,568 (  +4,219,666,432) ======
====== after completion "0"                        4,774,264,832 (     +34,004,992)    4,263,817,216 (      +1,179,648) ======
====== after completion "1"                        4,774,264,832 (              +0)    4,263,817,216 (              +0) ======
====== after set adapter 1 scale 1.0               4,789,526,528 (     +15,261,696)    4,279,021,568 (     +15,204,352) ======
====== after completion "2"                        4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after completion "3"                        4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after set adapter 1 scale 0.0               4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,383,296 (     +13,856,768)    4,292,915,200 (     +13,893,632) ======
====== after completion "4"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "5"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "6"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "7"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "8"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
...

Small RSS changes

Note that I am seeing a small (128k) occasional increase in resident memory usage (last column), which could be a different kind of leak, for example Python VM operations such as GC not reclaiming everything straight away. I don't think this is your memory leak though because I would expect llama.cpp allocated memory to be reflected in an increase in the virtual memory (first column), not just an increase in resident memory. Nonetheless, worth keeping an eye on it.

...
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "10"                       4,803,383,296 (              +0)    4,293,046,272 (        +131,072) ======
====== after completion "11"                       4,803,383,296 (              +0)    4,293,046,272 (              +0) ======
...
====== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,293,046,272 (              +0) ======
====== after completion "72"                       4,803,383,296 (              +0)    4,293,177,344 (        +131,072) ======
====== after completion "73"                       4,803,383,296 (              +0)    4,293,177,344 (              +0) ======
...
====== after set adapter 2 scale 1.0               4,803,383,296 (              +0)    4,293,177,344 (              +0) ======
====== after completion "100"                      4,803,383,296 (              +0)    4,293,308,416 (        +131,072) ======
====== after completion "101"                      4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
...
====== after completion "253"                      4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
...

Testing larger prompt and completion size

The previous test only tested completions for short numbers as prompts with very small max token size. A slightly larger test might show a leak.

I patched the create_completion call to generate something a bit larger. This used more memory but didn't seem to leak either.

  llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=str(i) + ' the quick brown fox jumped over the lazy dog who knows what will come next with a longer prompt')

====== initial                                        36,884,480 (     +36,884,480)       19,660,800 (     +19,660,800) ======
====== imported llama_cpp                            314,400,768 (    +277,516,288)       42,971,136 (     +23,310,336) ======
====== loaded model                                4,740,255,744 (  +4,425,854,976)    4,262,821,888 (  +4,219,850,752) ======
====== after completion "0"                        4,774,723,584 (     +34,467,840)    4,267,278,336 (      +4,456,448) ======
====== after completion "1"                        4,774,723,584 (              +0)    4,267,278,336 (              +0) ======
====== after set adapter 1 scale 1.0               4,789,907,456 (     +15,183,872)    4,282,482,688 (     +15,204,352) ======
====== after completion "2"                        4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after completion "3"                        4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after set adapter 1 scale 0.0               4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,760,128 (     +13,852,672)    4,296,376,320 (     +13,893,632) ======
====== after completion "4"                        4,803,760,128 (              +0)    4,296,507,392 (        +131,072) ======
====== after completion "5"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "6"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "7"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "8"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "9"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "10"                       4,803,760,128 (              +0)    4,296,769,536 (        +262,144) ======
====== after completion "11"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "12"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "13"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "14"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "15"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======

hrsmanian · 2024-12-10T22:01:16Z

Thanks Rich. I am still seeing a memory leak in GPU. Will try a previous build without your changes and keep you posted

richdougherty · 2024-12-10T22:34:47Z

Thanks for checking. To confirm, you ran the script I posted above?

If you are still seeing the leak my theory is that there's a leak in the llama.cpp CUDA implementation, which is why you're seeing the leak but I'm not seeing it with the CPU backend.

Currently I'm not thinking the leak is in the Python bindings, because if it was then I think we should see the leak for both backends.

This is just my theory though. I would definitely want more info to confirm - eg test different backends, try to replicate in llama.cpp directly.

If you're able then, running the script above would be good. If you don't have a chance then I should be able to use a cloud server with a GPU to test. (I am investigating how to do that.)

Thanks a lot for your interest and for testing!

hrsmanian · 2024-12-11T03:25:14Z

Have a decent repro now
I added the nvidia-smi output into your script. also, the model being used is the model i trained. Below is the output snapshot when no adapter is used. GPU memory remains constant. All Good.

====== after completion "1" 49,240,797,184 ( +532,480) 1,255,649,280 ( +659,456) ======
GPU Memory Used: [6729]
====== after completion "2" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "3" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "4" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "5" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "6" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "7" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "8" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "9" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "10" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "11" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "12" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "13" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "14" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "15"

Now below is the memory log when adapter is set. GPU memory increasing constantly

====== after completion "1" 49,240,805,376 ( +532,480) 1,255,985,152 ( +598,016) ======
GPU Memory Used: [6729]
====== after completion "2" 49,362,423,808 ( +121,618,432) 1,313,374,208 ( +57,389,056) ======
GPU Memory Used: [6773]
====== after completion "3" 49,362,423,808 ( +0) 1,313,374,208 ( +0) ======
GPU Memory Used: [6773]
====== after completion "4" 49,449,160,704 ( +86,736,896) 1,332,838,400 ( +19,464,192) ======
GPU Memory Used: [6811]
====== after completion "5" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6811]
====== after completion "6" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "7" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "8" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "9" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "10" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6821]
====== after completion "11" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6821]
====== after completion "12" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6823]
====== after completion "13" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6823]
====== after completion "14" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6825]
====== after completion "15" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======

hrsmanian · 2024-12-11T03:32:47Z

And if i set adapter only once outside the loop, then no increase in gpu memory

====== after completion "0" 49,362,345,984 ( +34,562,404,352) 1,307,557,888 ( +186,347,520) ======
GPU Memory Used: [6773]
====== after completion "1" 49,362,878,464 ( +532,480) 1,308,160,000 ( +602,112) ======
GPU Memory Used: [6773]
====== after completion "2" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "3" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "4" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "5" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "6" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "7" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "8" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "9" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "10" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "11" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "12" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "13" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "14" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "15" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======

hrsmanian · 2024-12-11T03:41:46Z

Interesting behavior. If i just set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference

====== after completion "0" 49,363,697,664 ( +34,563,756,032) 1,309,958,144 ( +188,268,544) ======
GPU Memory Used: [6779]
====== after completion "1" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ======
GPU Memory Used: [6787]
====== after completion "2" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ======
GPU Memory Used: [6795]
====== after completion "3" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ======
GPU Memory Used: [6803]
====== after completion "4" 49,363,832,832 ( +135,168) 1,310,285,824 ( +327,680) ======
GPU Memory Used: [6811]
====== after completion "5" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ======
GPU Memory Used: [6819]
====== after completion "6" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ======
GPU Memory Used: [6827]
====== after completion "7" 49,363,968,000 ( +135,168) 1,310,285,824 ( +0) ======
GPU Memory Used: [6835]
====== after completion "8" 49,363,968,000 ( +0) 1,310,285,824 ( +0) ======
GPU Memory Used: [6843]
====== after completion "9" 49,397,522,432 ( +33,554,432) 1,310,474,240 ( +188,416) ======
GPU Memory Used: [6851]
====== after completion "10" 49,397,657,600 ( +135,168) 1,310,474,240 ( +0) ======
GPU Memory Used: [6859]
====== after completion "11" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ======
GPU Memory Used: [6867]
====== after completion "12" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ======
GPU Memory Used: [6875]
====== after completion "13" 49,431,212,032 ( +33,554,432) 1,310,474,240 ( +0) ======
GPU Memory Used: [6883]
====== after completion "14" 49,431,347,200 ( +135,168) 1,310,474,240 ( +0) ======
GPU Memory Used: [6891]
====== after completion "15" 49,431,347,200 ( +0) 1,310,474,240 ( +0) ======
GPU Memory Used: [6899]

richdougherty · 2024-12-11T21:03:23Z

Thanks for confirming that. To summarise the info:

when no adapter is used. GPU memory remains constant

when adapter is set. GPU memory increasing constantly

(note: assume this means adapter is set in the loop using the code I sent?)

set adapter only once outside the loop, then no increase in gpu memory

set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference

I may try to write the same loop code using the llama.cpp C++ library directly, to try and isolate any issues from the Python bindings in this PR. (You are welcome to have a go with writing C++ if you wish, otherwise I will get to it this week.) I suspect an issue in the llama.cpp C++ layer due to the way it varies with different backends. But we will need a nice repro to isolate that and get help from the llama.cpp devs.

I will try to reproduce on GPU and maybe another backend like Vulkan, since CPU is not showing anything for me.

Another thing you could do perhaps that may clarify when memory is leaked would be to try logging these messages after any LoRA set adapter calls. That will show memory allocated on the LoRA load operation (if any).

===== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,293,046,272 (              +0) ======

Also perhaps we should log or vary the max_tokens since that seems relevant?

hrsmanian · 2024-12-11T22:07:05Z

All your statements above are true.
I can summarize even further

When no adapter is set, no memory increase
when adapter is set inside or outside the loop and max_tokens=16, memory increases but at a small rate
When adapter is set inside or outside the loop and max_tokens=256, memory increases by about 8Mb for each inference

Can you share how to run llama.cpp cmdline. I can run it on a gpu I have access to

richdougherty · 2024-12-11T23:48:16Z

Good idea to try the llama.cpp command line.

The compiled llama.cpp for the Python bindings is in the vendor subdirectory.

There is a normal llama.cpp cli but I'm not sure if it supports running multiple completions in a single session.

Perhaps you can try running the server and then calling it multiple times with curl or via the ui?

It's in the examples/server subdirectory.

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

You can load a LoRA with --lora or --lora-scaled. It should be possible to set the seed/max tokens etc to match the test case.

richdougherty · 2024-12-13T07:08:55Z

Hi @hrsmanian , here is a Bash scrip to test against llama.cpp.

First, compile the llama-server binary. This should be in the llama-cpp-python source directory.

cd vendor/llama.cpp/
make llama-server

Then run the below script, llama-server-memtest.sh.

#!/bin/bash

# Function to clean up server process
cleanup() {
    local exit_code=$?
    echo "Cleaning up..."
    if [ ! -z "$SERVER_PID" ]; then
        kill $SERVER_PID 2>/dev/null
        wait $SERVER_PID 2>/dev/null
    fi
    exit $exit_code
}

# Set up trap for script exit
trap cleanup EXIT

# Start llama-server in background
./llama-server \
  --model "$MODEL_GGUF" \
  --lora "$ADAPTER1_GGUF" &

# Save server PID
SERVER_PID=$!

# Wait for server to start up
sleep 5

# Function to log memory usage
log_memory() {
    local msg=$1
    # Get virtual and resident memory in bytes
    local mem=$(ps -o vsz=,rss= -p $SERVER_PID)
    local vsz=$(echo $mem | cut -d' ' -f1)
    local rss=$(echo $mem | cut -d' ' -f2)
    
    # Convert to bytes (ps shows KB)
    vsz=$((vsz * 1024))
    rss=$((rss * 1024))
    
    # Calculate deltas
    if [ -z "$PREV_VSZ" ]; then
        PREV_VSZ=$vsz
        PREV_RSS=$rss
    fi
    
    local delta_vsz=$((vsz - PREV_VSZ))
    local delta_rss=$((rss - PREV_RSS))
    
    # Format with commas for readability
    printf "====== %-40s %'16d (%+'16d) %'16d (%+'16d) ======\n" \
        "$msg" $vsz $delta_vsz $rss $delta_rss
    
    PREV_VSZ=$vsz
    PREV_RSS=$rss
}

# Log initial memory state
log_memory "initial"

# Run completions in a loop
for i in {1..100}; do
    curl --silent --request POST \
        --url http://127.0.0.1:8080/completion \
        --header "Content-Type: application/json" \
        --data "{\"seed\":12345,\"max_tokens\":16,\"temperature\":0,\"prompt\": \"$i\"}" \
        > /dev/null
    
    log_memory "after completion \"$i\""
done

When I run it I get output like:

$ ./llama-server-memtest.sh 2>&1 | tee server-memtest.log
...run for awhile...
^C <interrupt>

$ cat server-memtest.log | grep ===
====== initial                                    11,023,482,880 (              +0)    8,471,502,848 (              +0) ======
====== after completion "1"                       11,023,482,880 (              +0)    8,471,502,848 (              +0) ======
====== after completion "2"                       11,090,591,744 (     +67,108,864)    8,471,764,992 (        +262,144) ======
====== after completion "3"                       11,157,700,608 (     +67,108,864)    8,471,896,064 (        +131,072) ======
====== after completion "4"                       11,224,809,472 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "5"                       11,291,918,336 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "6"                       11,291,918,336 (              +0)    8,471,896,064 (              +0) ======
====== after completion "7"                       11,291,918,336 (              +0)    8,471,896,064 (              +0) ======
====== after completion "8"                       11,359,027,200 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "9"                       11,359,027,200 (              +0)    8,471,896,064 (              +0) ======
====== after completion "10"                      11,359,027,200 (              +0)    8,471,896,064 (              +0) ======
====== after completion "11"                      11,359,027,200 (              +0)    8,472,027,136 (        +131,072) ======
====== after completion "12"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "13"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "14"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
...
====== after completion "28"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "29"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "30"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======

There is memory growth, but it stabilises after awhile. The server might allocate IO buffers, perhaps it's doing caching, etc. It probably needs more analysis to know if there is a leak. I thought I'd share the script so you can look at GPU memory usage. For a really pure reproduction, we may need to write C++ code that uses the plain llama.cpp API, but testing with the llama-server app first is a good start.

SubatomicPlanets · 2025-01-04T00:46:22Z

Any progress on this? This would be a really helpful feature.

richdougherty force-pushed the update-lora-api branch from 0049150 to 5dc0a1e Compare November 2, 2024 00:06

feat: expose llama.cpp LoRA hot-swapping

c3a9cb0

richdougherty force-pushed the update-lora-api branch from d434c77 to c3a9cb0 Compare November 24, 2024 08:01

richdougherty marked this pull request as ready for review November 24, 2024 08:16

richdougherty commented Nov 24, 2024

View reviewed changes

niranjanakella reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LoRA hotswapping and multiple LoRAs at a time #1817

Support LoRA hotswapping and multiple LoRAs at a time #1817

richdougherty commented Oct 30, 2024 •

edited

Loading

richdougherty commented Nov 2, 2024 •

edited

Loading

hrsmanian commented Nov 18, 2024

richdougherty commented Nov 19, 2024

hrsmanian commented Nov 22, 2024

richdougherty commented Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024 •

edited

Loading

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

richdougherty Nov 24, 2024

hrsmanian commented Nov 25, 2024

niranjanakella Nov 27, 2024 •

edited

Loading

richdougherty commented Nov 27, 2024

hrsmanian commented Nov 27, 2024

hrsmanian commented Dec 6, 2024

richdougherty commented Dec 6, 2024 •

edited

Loading

hrsmanian commented Dec 10, 2024

richdougherty commented Dec 10, 2024

hrsmanian commented Dec 11, 2024

hrsmanian commented Dec 11, 2024

hrsmanian commented Dec 11, 2024

richdougherty commented Dec 11, 2024 •

edited

Loading

hrsmanian commented Dec 11, 2024

richdougherty commented Dec 11, 2024

richdougherty commented Dec 13, 2024

SubatomicPlanets commented Jan 4, 2025

Support LoRA hotswapping and multiple LoRAs at a time #1817

Are you sure you want to change the base?

Support LoRA hotswapping and multiple LoRAs at a time #1817

Conversation

richdougherty commented Oct 30, 2024 • edited Loading

richdougherty commented Nov 2, 2024 • edited Loading

hrsmanian commented Nov 18, 2024

richdougherty commented Nov 19, 2024

hrsmanian commented Nov 22, 2024

richdougherty commented Nov 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richdougherty Nov 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hrsmanian commented Nov 25, 2024

niranjanakella Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

richdougherty commented Nov 27, 2024

hrsmanian commented Nov 27, 2024

hrsmanian commented Dec 6, 2024

richdougherty commented Dec 6, 2024 • edited Loading

Test script

Small RSS changes

Testing larger prompt and completion size

hrsmanian commented Dec 10, 2024

richdougherty commented Dec 10, 2024

hrsmanian commented Dec 11, 2024

hrsmanian commented Dec 11, 2024

hrsmanian commented Dec 11, 2024

richdougherty commented Dec 11, 2024 • edited Loading

hrsmanian commented Dec 11, 2024

richdougherty commented Dec 11, 2024

richdougherty commented Dec 13, 2024

SubatomicPlanets commented Jan 4, 2025

richdougherty commented Oct 30, 2024 •

edited

Loading

richdougherty commented Nov 2, 2024 •

edited

Loading

richdougherty Nov 24, 2024 •

edited

Loading

niranjanakella Nov 27, 2024 •

edited

Loading

richdougherty commented Dec 6, 2024 •

edited

Loading

richdougherty commented Dec 11, 2024 •

edited

Loading