Replies: 3 comments 2 replies
-
Same issue for me also with llama2 13b model |
Beta Was this translation helpful? Give feedback.
-
So running inference with the quantized model + LoRA adapter is 2-3x slower then running inference with the quantized base model, or did you compare to the unquantized model? |
Beta Was this translation helpful? Give feedback.
-
I think the overhead is expected, when you use Peft models, especially lora, during inference there is an overhead due to the LoRA layers - see figure below As you perform at the same time the computation on the left and on the right and sum the final results, this creates an overhead that can be quite considerable during generation. However you can overcome this by "merging" the adapter weights into the base model as LoRA can be simply rewritten as a refactorization of simple matrix multiplication model = AutoModelForCausalLM(args.model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, args.adapter_path).to(args.device)
+ model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) More details about it here: https://huggingface.co/docs/peft/conceptual_guides/lora#merge-lora-weights-into-the-base-model |
Beta Was this translation helpful? Give feedback.
-
I have created a PEFT adapter for falcon-7b-instruct using QLoRa fine-tuning, and am inferencing it it on a T4. However, inferencing the adapted model is 2-3x slower than inferencing the base model directly. This surprised because I thought that the QLoRa adapter, being much smaller than the base model, and representing a perturbation to the weights of the base model, would not make inference all that much slower. What might I do to investigate the cause of the slow-down and speed it up? My full inference script is as follows:
Beta Was this translation helpful? Give feedback.
All reactions