Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LoRA hotswapping and multiple LoRAs at a time #1817

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

richdougherty
Copy link
Contributor

@richdougherty richdougherty commented Oct 30, 2024

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in ggerganov/llama.cpp#8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in ggerganov/llama.cpp#8332 are:

  • Refactor lora API
  • Allow hot-swapping lora
  • Added struct llama_lora_adapter to keep track of loaded lora

I have made some llama-cpp-python changes to enable this support:

  1. Updated C wrappers
  2. Added _internals.LlamaLoraAdapter to wrap llama.cpp's llama_lora_adapter
  3. Modified wrapper lifecycle to free llama_lora_adapter correctly
  4. Added high level API in Llama wrapper - now supports a dict of LoRA adapters to reflect llama.cpp's support for multiple LoRAs; also has method for changing LoRA scales
  5. Updated cache to have LoRA adapter weights in cache keys, because different active LoRAs will have different cache state
  6. Updated server to support hot-swapping LoRAs when a base model is shared

I have an example of usage through the API and via the server here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147#file-lora-md

Example API usage:

>>> import llama_cpp
>>> llm = llama_cpp.Llama("<model>") # Can also add LoRAs in dict here
>>> llm.lora_adapters
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 0.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0}
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_content_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=".....", stop=["\n"])

@richdougherty
Copy link
Contributor Author

richdougherty commented Nov 2, 2024

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file

@hrsmanian
Copy link

This seems to be a cool feature to have. Any idea when this will be available?

@richdougherty
Copy link
Contributor Author

The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge.

@hrsmanian
Copy link

Thanks Rich. Let me know when I can try it out.

@richdougherty richdougherty marked this pull request as ready for review November 24, 2024 08:16
@richdougherty
Copy link
Contributor Author

The code is ready for review now, thanks for your patience!

@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147

dest="lora",
)

class MultiTupleAction(argparse.Action):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed this fancy arg parse action to match the llama.cpp argument format which takes two arguments:

https://github.com/ggerganov/llama.cpp/blob/master/common/arg.cpp#L1546-L1551

):
print("error: failed to apply lora adapter")
return
for lora_path, scale in [(pth, 1.0) for pth in self.params.lora] + self.params.lora_scaled:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test this extensively, but this code at least worked this far on - the actual example failed later for me for unrelated reasons.

# when the llama_lora_adapters are freed.
def clear_lora_adapter():
self.lora_adapter = None
self.model._exit_stack.callback(clear_lora_adapter)
Copy link
Contributor Author

@richdougherty richdougherty Nov 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to be a clean way to keep the reference back to the parent LlamaModel up to date.

@@ -243,7 +242,7 @@ def __init__(
) # keep a reference to the array so it is not gc'd
self.model_params.tensor_split = self._c_tensor_split
self.model_params.vocab_only = vocab_only
self.model_params.use_mmap = use_mmap if lora_path is None else False
self.model_params.use_mmap = use_mmap
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory mapping is supported for LoRAs now because llama.cpp no longer merges the LoRA into the base model, so the original model GGUF is unchanged and can be mapped.

See the equivalent change in llama.cpp:
https://github.com/ggerganov/llama.cpp/pull/8332/files#diff-201cbc8fd17750764ed4a0862232e81503550c201995e16dc2e2766754eaa57aL688


self._stack.callback(free_lora_adapter)
# Dict from LoRA path to wrapper
self._lora_adapters_paths: Dict[str, internals.LlamaLoraAdapter] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Llama wrapper maintains a map to the low-level wrappers. We use the LoRA adapter path to key these objects. In theory we could track them using some sort of handle object, but this seemed OK for now.

@@ -174,8 +174,7 @@ def __init__(
offload_kqv: Offload K, Q, V to GPU.
flash_attn: Use flash attention.
last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque.
lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
lora_path: Path to a LoRA file to apply to the model.
lora_adapters: Paths to LoRA adapter files and the scale to apply to them at (scale of 0.0 will not be used during inference).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking API change, reflecting the change upstream. Is this OK?

@@ -453,6 +434,7 @@ def free_lora_adapter():
self._candidates = internals.LlamaTokenDataArray(n_vocab=self._n_vocab)

self.n_tokens = 0
self.tokens_lora_adapters: Tuple[Tuple[str, float]] = () # Adapters that processed tokens
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracks the LoRA adapters that were used to generate the n_tokens. A call to reset() sets this to the currently active adapters.

self.lora_adapters[lora_path] = scale
self._lora_adapters_active = tuple(sorted(
filter(lambda path_scale: path_scale[1] != 0.0, self.lora_adapters.items())
))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache a tuple with the active adapters - sorted and filtered so it can be used as a canonical cache key.

if type(key) == LlamaCacheKey:
return key
else:
return LlamaCacheKey(active_lora_adapters=(), tokens=tuple(key))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provides backwards compatibility for existing code which passes a Sequence[int] to access the cache. This format is still supported, although existing cached values won't be found.

values.pop('lora_adapters', None) # Different LoRA adapters can be hot-swapped
return values

if hot_swappable_settings(new_settings) == hot_swappable_settings(current_settings):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fast path lets us avoid loading the model again if it is the same model. If we are able to hot-swap then we can just update the LoRA adapter scales (loading new LoRAs if needed) and then exit early.

@hrsmanian
Copy link

The code is ready for review now, thanks for your patience!

@hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147

Thanks Rich. Seems to work as expected overall.
However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters

self.lora_scale = lora_scale
self.lora_path = lora_path
self.lora_adapters = (
lora_adapters if lora_adapters is None else {}
Copy link

@niranjanakella niranjanakella Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richdougherty this should be changed to the below, if not, even when the lora_adapters dictionary is passed it would be set to {}

lora_adapters if lora_adapters else None

@richdougherty
Copy link
Contributor Author

However, when I run it on 1000's of converasations, i see a constant increase in gpu memory. Could there be a memory leak? Or should we completely disable Lora adapters

That sounds like a memory leak for sure. Can you give me a bit more information about how you're running it? Are you loading (and unloading) LoRA adapters? Are you using the http server or the API? Also any info on how you're measuring memory usage, how fast it is growing, etc might be useful. Thanks!

@hrsmanian
Copy link

@richdougherty , I am running offline inference in a loop loading one conversation at a time
And in another window, I run nvidia-smi to see the gpu memory usage. It increases by roughly 4Mb for each inference.
I load the model and set the lora as below.

import llama_cpp
import json
import time

llm = llama_cpp.Llama("gguf_models/Llama-3.2-3B-Instruct-f16.gguf", n_gpu_layers=-1, verbose=False, n_ctx=6000)

#loading both the adapters
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 0.0)
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)

infile = "val_English.jsonl"
fp = open (infile, "r")
time1 = 0.0
time2 = 0.0
count = 0
for line in fp:
json_data = json.loads(line.strip())
dialogue = json_data['dialogue']

model_prompt = f"Summarize the text"

s1 = time.time()
############## activate 1st adapter and disable 2nd adapter
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp42_f16.gguf', 1.0)
llm.set_lora_adapter_scale('gguf_models/lora_3b_exp43_f16.gguf', 0.0)
print(f"LoRA State1: {llm.lora_adapters}")

prompt_1 = f"""<|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant for conversation summarization<|eot_id|><|start_header_id|>user<|end_header_id|>

{model_prompt}: {dialogue}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
output = llm.create_completion(seed=12345, temperature=0.01, top_p=0.99, top_k=250, max_tokens=256, prompt=prompt_1, stop=['<|eot_id|>'])
time1 += time.time() - s1
print_out1 = f"""MODEL OUTPUT:\n\n {output['choices'][0]['text']}

Usage: 
Input Tokens: {output['usage']['prompt_tokens']}
Output Tokens: {output['usage']['completion_tokens']}
Total Tokens: {output['usage']['total_tokens']}

"""
print(print_out1)

@hrsmanian
Copy link

Hi, anymore information needed on this? Kindly let me know

@richdougherty
Copy link
Contributor Author

richdougherty commented Dec 6, 2024

Hi @hrsmanian, apologies for the time to get back. Based on your explanation - seeing GPU memory usage increasing - it sounds like a leak in the llama.cpp allocated GPU memory. This could indicate either a bug in llama.cpp's LoRA adapter code or - more likely! - a bug in the bindings that I wrote.

A bug in the llama-cpp-python bindings in this PR would be something like incorrectly using the llama.cpp API and therefore causing those extra GPU allocations for LoRA adapters and perhaps forgetting to deallocate somehow.

Unfortunately, I only have a CPU for inference, but I believe I should be able to spot an incorrect usage of the llama.cpp LoRA API by seeing RAM usage increasing for the Python process. This is because llama.cpp will use the process RAM to store the models and adapters when using CPU inference. So because I'm using system RAM then I think that any mistakes with allocating or deallocating LoRA adapters should show up in the Python RAM usage directly as an increase in virtual memory usage.

Note: a Python object memory leak in the wrapper objects might not show up, since the Python heap might not visibly grow with each object due to its garbage-collected nature. But I think that looking at Python process RAM usage should be good enough to show the kind of leak that you're talking about in llama.cpp allocated memory. The memory allocated by llama.cpp won't be using the Python garbage-collected heap for its memory allocation, therefore it should be allocating memory directly which will be visible in the process memory used.

Test script

I've written a little script to test this. I'm using psutil to check the Python process RAM usage after each operation. This works for me for testing a leak when using CPU for inference.

I do not see a leak using this to test. Certainly nothing on the order of 4MB for each inference. Would you mind checking the script on your machine as well? I can also test your script if you want to link to your models / adapters and dataset you use to test, but I understand that these might be private.

You can try it using psutil to view RAM usage. To help with the GPU memory leak debugging perhaps you could edit the script so that it reports GPU usage. I saw there are a couple of libraries that can help with this (not tested by me). These are https://github.com/pmav99/nvsmi or https://pypi.org/project/nvidia-ml-py/ . You could patch the log_mem function to report GPU memory usage. That could give a useful log showing any leaks in GPU memory.

For my tests I am using the model and adapters described in the guide I wrote before: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147 . This test only tests very short prompts and completions. I tried a slightly larger prompt and completion further down in this comment, but perhaps you can test using your dataset as well.

export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
export ADAPTER1_GGUF=./adapters/lora_tldr_headline_gen.gguf 
export ADAPTER2_GGUF=./adapters/lora_tldr_content_gen.gguf

pip install psutil

python memtest.py 2>&1

memtest.py

import os

model_gguf = os.environ['MODEL_GGUF']
adapter1_gguf = os.environ['ADAPTER1_GGUF']
adapter2_gguf = os.environ['ADAPTER2_GGUF']

import psutil

process = psutil.Process()

prev_vms = 0
prev_rss = 0
def log_mem(msg):
  global prev_rss, prev_vms
  pmem = process.memory_info()
  vms = pmem.vms
  rss = pmem.rss
  delta_vms = vms - prev_vms
  delta_rss = rss - prev_rss

  print(f'====== {msg:<40} {vms:>16,} ({delta_vms:>+16,}) {rss:>16,} ({delta_rss:>+16,}) ======')

  prev_vms = vms
  prev_rss = rss

log_mem('initial')

import llama_cpp

log_mem('imported llama_cpp')

llm = llama_cpp.Llama(model_gguf)

log_mem('loaded model')

i = 0
for i in range(0, 100):

  # Create a pattern of enablement so we can see all patterns of enabled/disabled
  # as well has having sequences where no changes happen.
  desired_adapter1_scale = i // 2 % 2 * 1.0 # Enable 2 out of every 4 times
  desired_adapter2_scale = i // 4 % 2 * 1.0 # Enable 4 out of every 8 times

  # Check current state - note that we treat the initial state when they are not
  # loaded as 0.0 to ensure we have a couple of tests without them loaded
  lora_adapters = llm.lora_adapters or {}
  current_adapter1_scale = lora_adapters.get(adapter1_gguf, 0.0)
  current_adapter2_scale = lora_adapters.get(adapter2_gguf, 0.0)

  if current_adapter1_scale != desired_adapter1_scale:
    llm.set_lora_adapter_scale(adapter1_gguf, desired_adapter1_scale)
    log_mem(f'after set adapter 1 scale {desired_adapter1_scale}')
  if current_adapter2_scale != desired_adapter2_scale:
    llm.set_lora_adapter_scale(adapter2_gguf, desired_adapter2_scale)
    log_mem(f'after set adapter 2 scale {desired_adapter2_scale}')

  llm.create_completion(seed=12345, temperature=0, max_tokens=16, prompt=str(i))
  log_mem(f'after completion "{i}"')

When I run this I see initial allocations in virtual memory (first column) but it stays stable after the adapters have been loaded. The RAM usages stays the same after various loads and unloads.

python memtest.py 2>/dev/null

====== initial                                        36,880,384 (     +36,880,384)       19,529,728 (     +19,529,728) ======
====== imported llama_cpp                            314,400,768 (    +277,520,384)       42,971,136 (     +23,441,408) ======
====== loaded model                                4,740,259,840 (  +4,425,859,072)    4,262,637,568 (  +4,219,666,432) ======
====== after completion "0"                        4,774,264,832 (     +34,004,992)    4,263,817,216 (      +1,179,648) ======
====== after completion "1"                        4,774,264,832 (              +0)    4,263,817,216 (              +0) ======
====== after set adapter 1 scale 1.0               4,789,526,528 (     +15,261,696)    4,279,021,568 (     +15,204,352) ======
====== after completion "2"                        4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after completion "3"                        4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after set adapter 1 scale 0.0               4,789,526,528 (              +0)    4,279,021,568 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,383,296 (     +13,856,768)    4,292,915,200 (     +13,893,632) ======
====== after completion "4"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "5"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "6"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "7"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "8"                        4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
...

Small RSS changes

Note that I am seeing a small (128k) occasional increase in resident memory usage (last column), which could be a different kind of leak, for example Python VM operations such as GC not reclaiming everything straight away. I don't think this is your memory leak though because I would expect llama.cpp allocated memory to be reflected in an increase in the virtual memory (first column), not just an increase in resident memory. Nonetheless, worth keeping an eye on it.

...
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,292,915,200 (              +0) ======
====== after completion "10"                       4,803,383,296 (              +0)    4,293,046,272 (        +131,072) ======
====== after completion "11"                       4,803,383,296 (              +0)    4,293,046,272 (              +0) ======
...
====== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,293,046,272 (              +0) ======
====== after completion "72"                       4,803,383,296 (              +0)    4,293,177,344 (        +131,072) ======
====== after completion "73"                       4,803,383,296 (              +0)    4,293,177,344 (              +0) ======
...
====== after set adapter 2 scale 1.0               4,803,383,296 (              +0)    4,293,177,344 (              +0) ======
====== after completion "100"                      4,803,383,296 (              +0)    4,293,308,416 (        +131,072) ======
====== after completion "101"                      4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
...
====== after completion "253"                      4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,383,296 (              +0)    4,293,308,416 (              +0) ======
...

Testing larger prompt and completion size

The previous test only tested completions for short numbers as prompts with very small max token size. A slightly larger test might show a leak.

I patched the create_completion call to generate something a bit larger. This used more memory but didn't seem to leak either.

  llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=str(i) + ' the quick brown fox jumped over the lazy dog who knows what will come next with a longer prompt')
====== initial                                        36,884,480 (     +36,884,480)       19,660,800 (     +19,660,800) ======
====== imported llama_cpp                            314,400,768 (    +277,516,288)       42,971,136 (     +23,310,336) ======
====== loaded model                                4,740,255,744 (  +4,425,854,976)    4,262,821,888 (  +4,219,850,752) ======
====== after completion "0"                        4,774,723,584 (     +34,467,840)    4,267,278,336 (      +4,456,448) ======
====== after completion "1"                        4,774,723,584 (              +0)    4,267,278,336 (              +0) ======
====== after set adapter 1 scale 1.0               4,789,907,456 (     +15,183,872)    4,282,482,688 (     +15,204,352) ======
====== after completion "2"                        4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after completion "3"                        4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after set adapter 1 scale 0.0               4,789,907,456 (              +0)    4,282,482,688 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,760,128 (     +13,852,672)    4,296,376,320 (     +13,893,632) ======
====== after completion "4"                        4,803,760,128 (              +0)    4,296,507,392 (        +131,072) ======
====== after completion "5"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "6"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "7"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "8"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "9"                        4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,507,392 (              +0) ======
====== after completion "10"                       4,803,760,128 (              +0)    4,296,769,536 (        +262,144) ======
====== after completion "11"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 2 scale 1.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "12"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "13"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 1.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "14"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after completion "15"                       4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 1 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======
====== after set adapter 2 scale 0.0               4,803,760,128 (              +0)    4,296,769,536 (              +0) ======

@hrsmanian
Copy link

Thanks Rich. I am still seeing a memory leak in GPU. Will try a previous build without your changes and keep you posted

@richdougherty
Copy link
Contributor Author

Thanks for checking. To confirm, you ran the script I posted above?

If you are still seeing the leak my theory is that there's a leak in the llama.cpp CUDA implementation, which is why you're seeing the leak but I'm not seeing it with the CPU backend.

Currently I'm not thinking the leak is in the Python bindings, because if it was then I think we should see the leak for both backends.

This is just my theory though. I would definitely want more info to confirm - eg test different backends, try to replicate in llama.cpp directly.

If you're able then, running the script above would be good. If you don't have a chance then I should be able to use a cloud server with a GPU to test. (I am investigating how to do that.)

Thanks a lot for your interest and for testing!

@hrsmanian
Copy link

Have a decent repro now
I added the nvidia-smi output into your script. also, the model being used is the model i trained. Below is the output snapshot when no adapter is used. GPU memory remains constant. All Good.

====== after completion "1" 49,240,797,184 ( +532,480) 1,255,649,280 ( +659,456) ======
GPU Memory Used: [6729]
====== after completion "2" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "3" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "4" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "5" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "6" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "7" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "8" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "9" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "10" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "11" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "12" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "13" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "14" 49,240,797,184 ( +0) 1,255,649,280 ( +0) ======
GPU Memory Used: [6729]
====== after completion "15"

Now below is the memory log when adapter is set. GPU memory increasing constantly

====== after completion "1" 49,240,805,376 ( +532,480) 1,255,985,152 ( +598,016) ======
GPU Memory Used: [6729]
====== after completion "2" 49,362,423,808 ( +121,618,432) 1,313,374,208 ( +57,389,056) ======
GPU Memory Used: [6773]
====== after completion "3" 49,362,423,808 ( +0) 1,313,374,208 ( +0) ======
GPU Memory Used: [6773]
====== after completion "4" 49,449,160,704 ( +86,736,896) 1,332,838,400 ( +19,464,192) ======
GPU Memory Used: [6811]
====== after completion "5" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6811]
====== after completion "6" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "7" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "8" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "9" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6819]
====== after completion "10" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6821]
====== after completion "11" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6821]
====== after completion "12" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6823]
====== after completion "13" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6823]
====== after completion "14" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======
GPU Memory Used: [6825]
====== after completion "15" 49,449,160,704 ( +0) 1,332,838,400 ( +0) ======

@hrsmanian
Copy link

And if i set adapter only once outside the loop, then no increase in gpu memory

====== after completion "0" 49,362,345,984 ( +34,562,404,352) 1,307,557,888 ( +186,347,520) ======
GPU Memory Used: [6773]
====== after completion "1" 49,362,878,464 ( +532,480) 1,308,160,000 ( +602,112) ======
GPU Memory Used: [6773]
====== after completion "2" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "3" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "4" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "5" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "6" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "7" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "8" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "9" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "10" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "11" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "12" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "13" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "14" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======
GPU Memory Used: [6773]
====== after completion "15" 49,362,878,464 ( +0) 1,308,160,000 ( +0) ======

@hrsmanian
Copy link

Interesting behavior. If i just set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference

====== after completion "0" 49,363,697,664 ( +34,563,756,032) 1,309,958,144 ( +188,268,544) ======
GPU Memory Used: [6779]
====== after completion "1" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ======
GPU Memory Used: [6787]
====== after completion "2" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ======
GPU Memory Used: [6795]
====== after completion "3" 49,363,697,664 ( +0) 1,309,958,144 ( +0) ======
GPU Memory Used: [6803]
====== after completion "4" 49,363,832,832 ( +135,168) 1,310,285,824 ( +327,680) ======
GPU Memory Used: [6811]
====== after completion "5" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ======
GPU Memory Used: [6819]
====== after completion "6" 49,363,832,832 ( +0) 1,310,285,824 ( +0) ======
GPU Memory Used: [6827]
====== after completion "7" 49,363,968,000 ( +135,168) 1,310,285,824 ( +0) ======
GPU Memory Used: [6835]
====== after completion "8" 49,363,968,000 ( +0) 1,310,285,824 ( +0) ======
GPU Memory Used: [6843]
====== after completion "9" 49,397,522,432 ( +33,554,432) 1,310,474,240 ( +188,416) ======
GPU Memory Used: [6851]
====== after completion "10" 49,397,657,600 ( +135,168) 1,310,474,240 ( +0) ======
GPU Memory Used: [6859]
====== after completion "11" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ======
GPU Memory Used: [6867]
====== after completion "12" 49,397,657,600 ( +0) 1,310,474,240 ( +0) ======
GPU Memory Used: [6875]
====== after completion "13" 49,431,212,032 ( +33,554,432) 1,310,474,240 ( +0) ======
GPU Memory Used: [6883]
====== after completion "14" 49,431,347,200 ( +135,168) 1,310,474,240 ( +0) ======
GPU Memory Used: [6891]
====== after completion "15" 49,431,347,200 ( +0) 1,310,474,240 ( +0) ======
GPU Memory Used: [6899]

@richdougherty
Copy link
Contributor Author

richdougherty commented Dec 11, 2024

Thanks for confirming that. To summarise the info:

when no adapter is used. GPU memory remains constant

when adapter is set. GPU memory increasing constantly

(note: assume this means adapter is set in the loop using the code I sent?)

set adapter only once outside the loop, then no increase in gpu memory

set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference

I may try to write the same loop code using the llama.cpp C++ library directly, to try and isolate any issues from the Python bindings in this PR. (You are welcome to have a go with writing C++ if you wish, otherwise I will get to it this week.) I suspect an issue in the llama.cpp C++ layer due to the way it varies with different backends. But we will need a nice repro to isolate that and get help from the llama.cpp devs.

I will try to reproduce on GPU and maybe another backend like Vulkan, since CPU is not showing anything for me.

Another thing you could do perhaps that may clarify when memory is leaked would be to try logging these messages after any LoRA set adapter calls. That will show memory allocated on the LoRA load operation (if any).

===== after set adapter 2 scale 0.0               4,803,383,296 (              +0)    4,293,046,272 (              +0) ======

Also perhaps we should log or vary the max_tokens since that seems relevant?

@hrsmanian
Copy link

All your statements above are true.
I can summarize even further

  • When no adapter is set, no memory increase
  • when adapter is set inside or outside the loop and max_tokens=16, memory increases but at a small rate
  • When adapter is set inside or outside the loop and max_tokens=256, memory increases by about 8Mb for each inference

Can you share how to run llama.cpp cmdline. I can run it on a gpu I have access to

@richdougherty
Copy link
Contributor Author

Good idea to try the llama.cpp command line.

The compiled llama.cpp for the Python bindings is in the vendor subdirectory.

There is a normal llama.cpp cli but I'm not sure if it supports running multiple completions in a single session.

Perhaps you can try running the server and then calling it multiple times with curl or via the ui?

It's in the examples/server subdirectory.

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

You can load a LoRA with --lora or --lora-scaled. It should be possible to set the seed/max tokens etc to match the test case.

@richdougherty
Copy link
Contributor Author

Hi @hrsmanian , here is a Bash scrip to test against llama.cpp.

First, compile the llama-server binary. This should be in the llama-cpp-python source directory.

cd vendor/llama.cpp/
make llama-server

Then run the below script, llama-server-memtest.sh.

#!/bin/bash

# Function to clean up server process
cleanup() {
    local exit_code=$?
    echo "Cleaning up..."
    if [ ! -z "$SERVER_PID" ]; then
        kill $SERVER_PID 2>/dev/null
        wait $SERVER_PID 2>/dev/null
    fi
    exit $exit_code
}

# Set up trap for script exit
trap cleanup EXIT

# Start llama-server in background
./llama-server \
  --model "$MODEL_GGUF" \
  --lora "$ADAPTER1_GGUF" &

# Save server PID
SERVER_PID=$!

# Wait for server to start up
sleep 5

# Function to log memory usage
log_memory() {
    local msg=$1
    # Get virtual and resident memory in bytes
    local mem=$(ps -o vsz=,rss= -p $SERVER_PID)
    local vsz=$(echo $mem | cut -d' ' -f1)
    local rss=$(echo $mem | cut -d' ' -f2)
    
    # Convert to bytes (ps shows KB)
    vsz=$((vsz * 1024))
    rss=$((rss * 1024))
    
    # Calculate deltas
    if [ -z "$PREV_VSZ" ]; then
        PREV_VSZ=$vsz
        PREV_RSS=$rss
    fi
    
    local delta_vsz=$((vsz - PREV_VSZ))
    local delta_rss=$((rss - PREV_RSS))
    
    # Format with commas for readability
    printf "====== %-40s %'16d (%+'16d) %'16d (%+'16d) ======\n" \
        "$msg" $vsz $delta_vsz $rss $delta_rss
    
    PREV_VSZ=$vsz
    PREV_RSS=$rss
}

# Log initial memory state
log_memory "initial"

# Run completions in a loop
for i in {1..100}; do
    curl --silent --request POST \
        --url http://127.0.0.1:8080/completion \
        --header "Content-Type: application/json" \
        --data "{\"seed\":12345,\"max_tokens\":16,\"temperature\":0,\"prompt\": \"$i\"}" \
        > /dev/null
    
    log_memory "after completion \"$i\""
done

When I run it I get output like:

$ ./llama-server-memtest.sh 2>&1 | tee server-memtest.log
...run for awhile...
^C <interrupt>

$ cat server-memtest.log | grep ===
====== initial                                    11,023,482,880 (              +0)    8,471,502,848 (              +0) ======
====== after completion "1"                       11,023,482,880 (              +0)    8,471,502,848 (              +0) ======
====== after completion "2"                       11,090,591,744 (     +67,108,864)    8,471,764,992 (        +262,144) ======
====== after completion "3"                       11,157,700,608 (     +67,108,864)    8,471,896,064 (        +131,072) ======
====== after completion "4"                       11,224,809,472 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "5"                       11,291,918,336 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "6"                       11,291,918,336 (              +0)    8,471,896,064 (              +0) ======
====== after completion "7"                       11,291,918,336 (              +0)    8,471,896,064 (              +0) ======
====== after completion "8"                       11,359,027,200 (     +67,108,864)    8,471,896,064 (              +0) ======
====== after completion "9"                       11,359,027,200 (              +0)    8,471,896,064 (              +0) ======
====== after completion "10"                      11,359,027,200 (              +0)    8,471,896,064 (              +0) ======
====== after completion "11"                      11,359,027,200 (              +0)    8,472,027,136 (        +131,072) ======
====== after completion "12"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "13"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "14"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
...
====== after completion "28"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "29"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======
====== after completion "30"                      11,359,027,200 (              +0)    8,472,027,136 (              +0) ======

There is memory growth, but it stabilises after awhile. The server might allocate IO buffers, perhaps it's doing caching, etc. It probably needs more analysis to know if there is a leak. I thought I'd share the script so you can look at GPU memory usage. For a really pure reproduction, we may need to write C++ code that uses the plain llama.cpp API, but testing with the llama-server app first is a good start.

@SubatomicPlanets
Copy link

Any progress on this? This would be a really helpful feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants