-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LoRA hotswapping and multiple LoRAs at a time #1817
base: main
Are you sure you want to change the base?
Conversation
0049150
to
5dc0a1e
Compare
Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead. {
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model_alias": "mistral",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"verbose": true
},
{
"model_alias": "mistral-magicoder",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"lora_adapters": {
"./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
},
"verbose": true
},
{
"model_alias": "mistral-conllpp",
"model": "./mistral-7b-v0.1.Q4_K_S.gguf",
"lora_adapters": {
"./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
},
"verbose": true
}
]
} Then calling the OpenAI compatible API with
|
This seems to be a cool feature to have. Any idea when this will be available? |
The code is pretty much done and working. I plan to tidy it up a little this weekend, ready for review and (hopefully) merge. |
Thanks Rich. Let me know when I can try it out. |
d434c77
to
c3a9cb0
Compare
The code is ready for review now, thanks for your patience! @hrsmanian , if you want to try it out before it is merged, a guide for usage is here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147 |
dest="lora", | ||
) | ||
|
||
class MultiTupleAction(argparse.Action): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed this fancy arg parse action to match the llama.cpp argument format which takes two arguments:
https://github.com/ggerganov/llama.cpp/blob/master/common/arg.cpp#L1546-L1551
): | ||
print("error: failed to apply lora adapter") | ||
return | ||
for lora_path, scale in [(pth, 1.0) for pth in self.params.lora] + self.params.lora_scaled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test this extensively, but this code at least worked this far on - the actual example failed later for me for unrelated reasons.
# when the llama_lora_adapters are freed. | ||
def clear_lora_adapter(): | ||
self.lora_adapter = None | ||
self.model._exit_stack.callback(clear_lora_adapter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed to be a clean way to keep the reference back to the parent LlamaModel
up to date.
@@ -243,7 +242,7 @@ def __init__( | |||
) # keep a reference to the array so it is not gc'd | |||
self.model_params.tensor_split = self._c_tensor_split | |||
self.model_params.vocab_only = vocab_only | |||
self.model_params.use_mmap = use_mmap if lora_path is None else False | |||
self.model_params.use_mmap = use_mmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory mapping is supported for LoRAs now because llama.cpp no longer merges the LoRA into the base model, so the original model GGUF is unchanged and can be mapped.
See the equivalent change in llama.cpp:
https://github.com/ggerganov/llama.cpp/pull/8332/files#diff-201cbc8fd17750764ed4a0862232e81503550c201995e16dc2e2766754eaa57aL688
|
||
self._stack.callback(free_lora_adapter) | ||
# Dict from LoRA path to wrapper | ||
self._lora_adapters_paths: Dict[str, internals.LlamaLoraAdapter] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Llama
wrapper maintains a map to the low-level wrappers. We use the LoRA adapter path to key these objects. In theory we could track them using some sort of handle object, but this seemed OK for now.
@@ -174,8 +174,7 @@ def __init__( | |||
offload_kqv: Offload K, Q, V to GPU. | |||
flash_attn: Use flash attention. | |||
last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque. | |||
lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model. | |||
lora_path: Path to a LoRA file to apply to the model. | |||
lora_adapters: Paths to LoRA adapter files and the scale to apply to them at (scale of 0.0 will not be used during inference). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking API change, reflecting the change upstream. Is this OK?
@@ -453,6 +434,7 @@ def free_lora_adapter(): | |||
self._candidates = internals.LlamaTokenDataArray(n_vocab=self._n_vocab) | |||
|
|||
self.n_tokens = 0 | |||
self.tokens_lora_adapters: Tuple[Tuple[str, float]] = () # Adapters that processed tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracks the LoRA adapters that were used to generate the n_tokens
. A call to reset()
sets this to the currently active adapters.
self.lora_adapters[lora_path] = scale | ||
self._lora_adapters_active = tuple(sorted( | ||
filter(lambda path_scale: path_scale[1] != 0.0, self.lora_adapters.items()) | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache a tuple with the active adapters - sorted and filtered so it can be used as a canonical cache key.
if type(key) == LlamaCacheKey: | ||
return key | ||
else: | ||
return LlamaCacheKey(active_lora_adapters=(), tokens=tuple(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provides backwards compatibility for existing code which passes a Sequence[int]
to access the cache. This format is still supported, although existing cached values won't be found.
values.pop('lora_adapters', None) # Different LoRA adapters can be hot-swapped | ||
return values | ||
|
||
if hot_swappable_settings(new_settings) == hot_swappable_settings(current_settings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fast path lets us avoid loading the model again if it is the same model. If we are able to hot-swap then we can just update the LoRA adapter scales (loading new LoRAs if needed) and then exit early.
Thanks Rich. Seems to work as expected overall. |
self.lora_scale = lora_scale | ||
self.lora_path = lora_path | ||
self.lora_adapters = ( | ||
lora_adapters if lora_adapters is None else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richdougherty this should be changed to the below, if not, even when the lora_adapters dictionary is passed it would be set to {}
lora_adapters if lora_adapters else None
That sounds like a memory leak for sure. Can you give me a bit more information about how you're running it? Are you loading (and unloading) LoRA adapters? Are you using the http server or the API? Also any info on how you're measuring memory usage, how fast it is growing, etc might be useful. Thanks! |
@richdougherty , I am running offline inference in a loop loading one conversation at a time import llama_cpp llm = llama_cpp.Llama("gguf_models/Llama-3.2-3B-Instruct-f16.gguf", n_gpu_layers=-1, verbose=False, n_ctx=6000) #loading both the adapters infile = "val_English.jsonl"
prompt_1 = f"""<|start_header_id|>system<|end_header_id|> You are a helpful AI assistant for conversation summarization<|eot_id|><|start_header_id|>user<|end_header_id|> {model_prompt}: {dialogue}<|eot_id|><|start_header_id|>assistant<|end_header_id|> """
|
Hi, anymore information needed on this? Kindly let me know |
Hi @hrsmanian, apologies for the time to get back. Based on your explanation - seeing GPU memory usage increasing - it sounds like a leak in the llama.cpp allocated GPU memory. This could indicate either a bug in llama.cpp's LoRA adapter code or - more likely! - a bug in the bindings that I wrote. A bug in the llama-cpp-python bindings in this PR would be something like incorrectly using the llama.cpp API and therefore causing those extra GPU allocations for LoRA adapters and perhaps forgetting to deallocate somehow. Unfortunately, I only have a CPU for inference, but I believe I should be able to spot an incorrect usage of the llama.cpp LoRA API by seeing RAM usage increasing for the Python process. This is because llama.cpp will use the process RAM to store the models and adapters when using CPU inference. So because I'm using system RAM then I think that any mistakes with allocating or deallocating LoRA adapters should show up in the Python RAM usage directly as an increase in virtual memory usage. Note: a Python object memory leak in the wrapper objects might not show up, since the Python heap might not visibly grow with each object due to its garbage-collected nature. But I think that looking at Python process RAM usage should be good enough to show the kind of leak that you're talking about in llama.cpp allocated memory. The memory allocated by llama.cpp won't be using the Python garbage-collected heap for its memory allocation, therefore it should be allocating memory directly which will be visible in the process memory used. Test scriptI've written a little script to test this. I'm using I do not see a leak using this to test. Certainly nothing on the order of 4MB for each inference. Would you mind checking the script on your machine as well? I can also test your script if you want to link to your models / adapters and dataset you use to test, but I understand that these might be private. You can try it using For my tests I am using the model and adapters described in the guide I wrote before: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147 . This test only tests very short prompts and completions. I tried a slightly larger prompt and completion further down in this comment, but perhaps you can test using your dataset as well. export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
export ADAPTER1_GGUF=./adapters/lora_tldr_headline_gen.gguf
export ADAPTER2_GGUF=./adapters/lora_tldr_content_gen.gguf
pip install psutil
python memtest.py 2>&1
import os
model_gguf = os.environ['MODEL_GGUF']
adapter1_gguf = os.environ['ADAPTER1_GGUF']
adapter2_gguf = os.environ['ADAPTER2_GGUF']
import psutil
process = psutil.Process()
prev_vms = 0
prev_rss = 0
def log_mem(msg):
global prev_rss, prev_vms
pmem = process.memory_info()
vms = pmem.vms
rss = pmem.rss
delta_vms = vms - prev_vms
delta_rss = rss - prev_rss
print(f'====== {msg:<40} {vms:>16,} ({delta_vms:>+16,}) {rss:>16,} ({delta_rss:>+16,}) ======')
prev_vms = vms
prev_rss = rss
log_mem('initial')
import llama_cpp
log_mem('imported llama_cpp')
llm = llama_cpp.Llama(model_gguf)
log_mem('loaded model')
i = 0
for i in range(0, 100):
# Create a pattern of enablement so we can see all patterns of enabled/disabled
# as well has having sequences where no changes happen.
desired_adapter1_scale = i // 2 % 2 * 1.0 # Enable 2 out of every 4 times
desired_adapter2_scale = i // 4 % 2 * 1.0 # Enable 4 out of every 8 times
# Check current state - note that we treat the initial state when they are not
# loaded as 0.0 to ensure we have a couple of tests without them loaded
lora_adapters = llm.lora_adapters or {}
current_adapter1_scale = lora_adapters.get(adapter1_gguf, 0.0)
current_adapter2_scale = lora_adapters.get(adapter2_gguf, 0.0)
if current_adapter1_scale != desired_adapter1_scale:
llm.set_lora_adapter_scale(adapter1_gguf, desired_adapter1_scale)
log_mem(f'after set adapter 1 scale {desired_adapter1_scale}')
if current_adapter2_scale != desired_adapter2_scale:
llm.set_lora_adapter_scale(adapter2_gguf, desired_adapter2_scale)
log_mem(f'after set adapter 2 scale {desired_adapter2_scale}')
llm.create_completion(seed=12345, temperature=0, max_tokens=16, prompt=str(i))
log_mem(f'after completion "{i}"') When I run this I see initial allocations in virtual memory (first column) but it stays stable after the adapters have been loaded. The RAM usages stays the same after various loads and unloads.
Small RSS changesNote that I am seeing a small (128k) occasional increase in resident memory usage (last column), which could be a different kind of leak, for example Python VM operations such as GC not reclaiming everything straight away. I don't think this is your memory leak though because I would expect llama.cpp allocated memory to be reflected in an increase in the virtual memory (first column), not just an increase in resident memory. Nonetheless, worth keeping an eye on it.
Testing larger prompt and completion sizeThe previous test only tested completions for short numbers as prompts with very small max token size. A slightly larger test might show a leak. I patched the llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt=str(i) + ' the quick brown fox jumped over the lazy dog who knows what will come next with a longer prompt')
|
Thanks Rich. I am still seeing a memory leak in GPU. Will try a previous build without your changes and keep you posted |
Thanks for checking. To confirm, you ran the script I posted above? If you are still seeing the leak my theory is that there's a leak in the llama.cpp CUDA implementation, which is why you're seeing the leak but I'm not seeing it with the CPU backend. Currently I'm not thinking the leak is in the Python bindings, because if it was then I think we should see the leak for both backends. This is just my theory though. I would definitely want more info to confirm - eg test different backends, try to replicate in llama.cpp directly. If you're able then, running the script above would be good. If you don't have a chance then I should be able to use a cloud server with a GPU to test. (I am investigating how to do that.) Thanks a lot for your interest and for testing! |
Have a decent repro now ====== after completion "1" 49,240,797,184 ( +532,480) 1,255,649,280 ( +659,456) ====== Now below is the memory log when adapter is set. GPU memory increasing constantly ====== after completion "1" 49,240,805,376 ( +532,480) 1,255,985,152 ( +598,016) ====== |
And if i set adapter only once outside the loop, then no increase in gpu memory ====== after completion "0" 49,362,345,984 ( +34,562,404,352) 1,307,557,888 ( +186,347,520) ====== |
Interesting behavior. If i just set one adapter outside the loop and increase the max_tokens=256 and start inference, I see memory increase by 8MB across inference ====== after completion "0" 49,363,697,664 ( +34,563,756,032) 1,309,958,144 ( +188,268,544) ====== |
Thanks for confirming that. To summarise the info:
(note: assume this means adapter is set in the loop using the code I sent?)
I may try to write the same loop code using the llama.cpp C++ library directly, to try and isolate any issues from the Python bindings in this PR. (You are welcome to have a go with writing C++ if you wish, otherwise I will get to it this week.) I suspect an issue in the llama.cpp C++ layer due to the way it varies with different backends. But we will need a nice repro to isolate that and get help from the llama.cpp devs. I will try to reproduce on GPU and maybe another backend like Vulkan, since CPU is not showing anything for me. Another thing you could do perhaps that may clarify when memory is leaked would be to try logging these messages after any LoRA set adapter calls. That will show memory allocated on the LoRA load operation (if any).
Also perhaps we should log or vary the max_tokens since that seems relevant? |
All your statements above are true.
Can you share how to run llama.cpp cmdline. I can run it on a gpu I have access to |
Good idea to try the llama.cpp command line. The compiled llama.cpp for the Python bindings is in the vendor subdirectory. There is a normal llama.cpp cli but I'm not sure if it supports running multiple completions in a single session. Perhaps you can try running the server and then calling it multiple times with curl or via the ui? It's in the examples/server subdirectory. https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md You can load a LoRA with --lora or --lora-scaled. It should be possible to set the seed/max tokens etc to match the test case. |
Hi @hrsmanian , here is a Bash scrip to test against llama.cpp. First, compile the llama-server binary. This should be in the llama-cpp-python source directory. cd vendor/llama.cpp/
make llama-server Then run the below script, #!/bin/bash
# Function to clean up server process
cleanup() {
local exit_code=$?
echo "Cleaning up..."
if [ ! -z "$SERVER_PID" ]; then
kill $SERVER_PID 2>/dev/null
wait $SERVER_PID 2>/dev/null
fi
exit $exit_code
}
# Set up trap for script exit
trap cleanup EXIT
# Start llama-server in background
./llama-server \
--model "$MODEL_GGUF" \
--lora "$ADAPTER1_GGUF" &
# Save server PID
SERVER_PID=$!
# Wait for server to start up
sleep 5
# Function to log memory usage
log_memory() {
local msg=$1
# Get virtual and resident memory in bytes
local mem=$(ps -o vsz=,rss= -p $SERVER_PID)
local vsz=$(echo $mem | cut -d' ' -f1)
local rss=$(echo $mem | cut -d' ' -f2)
# Convert to bytes (ps shows KB)
vsz=$((vsz * 1024))
rss=$((rss * 1024))
# Calculate deltas
if [ -z "$PREV_VSZ" ]; then
PREV_VSZ=$vsz
PREV_RSS=$rss
fi
local delta_vsz=$((vsz - PREV_VSZ))
local delta_rss=$((rss - PREV_RSS))
# Format with commas for readability
printf "====== %-40s %'16d (%+'16d) %'16d (%+'16d) ======\n" \
"$msg" $vsz $delta_vsz $rss $delta_rss
PREV_VSZ=$vsz
PREV_RSS=$rss
}
# Log initial memory state
log_memory "initial"
# Run completions in a loop
for i in {1..100}; do
curl --silent --request POST \
--url http://127.0.0.1:8080/completion \
--header "Content-Type: application/json" \
--data "{\"seed\":12345,\"max_tokens\":16,\"temperature\":0,\"prompt\": \"$i\"}" \
> /dev/null
log_memory "after completion \"$i\""
done When I run it I get output like:
There is memory growth, but it stabilises after awhile. The server might allocate IO buffers, perhaps it's doing caching, etc. It probably needs more analysis to know if there is a leak. I thought I'd share the script so you can look at GPU memory usage. For a really pure reproduction, we may need to write C++ code that uses the plain llama.cpp API, but testing with the llama-server app first is a good start. |
Any progress on this? This would be a really helpful feature. |
This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in ggerganov/llama.cpp#8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)
The list of changes from upstream in ggerganov/llama.cpp#8332 are:
I have made some llama-cpp-python changes to enable this support:
_internals.LlamaLoraAdapter
to wrap llama.cpp'sllama_lora_adapter
llama_lora_adapter
correctlyI have an example of usage through the API and via the server here: https://gist.github.com/richdougherty/dd4961a72fafbff5216b4bc9f48b1147#file-lora-md
Example API usage: