Inference with a PEFT adapter is slow #680

alexflint · 2023-07-10T19:41:06Z

alexflint
Jul 10, 2023

I have created a PEFT adapter for falcon-7b-instruct using QLoRa fine-tuning, and am inferencing it it on a T4. However, inferencing the adapted model is 2-3x slower than inferencing the base model directly. This surprised because I thought that the QLoRa adapter, being much smaller than the base model, and representing a perturbation to the weights of the base model, would not make inference all that much slower. What might I do to investigate the cause of the slow-down and speed it up? My full inference script is as follows:

    # Load the config file
    config = PeftConfig.from_pretrained(args.adapter)

    # Setup the bitsandbytes config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    # Load the base model
    base_model = AutoModelForCausalLM.from_pretrained(
        args.base_model,
        return_dict=True,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.base_model)
    tokenizer.pad_token = tokenizer.eos_token

    # Load the adapter
    adapted_model = PeftModel.from_pretrained(base_model, ".")

    # Parameters for inference
    generation_config = adapted_model.generation_config
    generation_config.max_new_tokens = args.tokens
    generation_config.temperature = 0.8
    generation_config.top_p = 0.7
    generation_config.num_return_sequences = 1
    generation_config.no_repeat_ngram_size = 3
    generation_config.repetition_penalty = 1.5
    generation_config.pad_token_id = tokenizer.eos_token_id
    generation_config.eos_token_id = tokenizer.eos_token_id

    # Generate some text
    device = "cuda:0"
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        begin = timer()
        outputs = adapted_model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
        print(f"generating the tokens took {timer()-begin:.2f}s")

kira082 · 2023-12-19T12:54:06Z

kira082
Dec 19, 2023

Same issue for me also with llama2 13b model

0 replies

BenjaminBossan · 2023-12-19T14:54:53Z

BenjaminBossan
Dec 19, 2023
Maintainer

However, inferencing the adapted model is 2-3x slower than inferencing the base model directly.

So running inference with the quantized model + LoRA adapter is 2-3x slower then running inference with the quantized base model, or did you compare to the unquantized model?

2 replies

kira082 Dec 19, 2023

I am comparing unquantized based model + LoRA adapter (Finetune ) vs quantized model + LoRA adapter which give difference more than 5- 6 slower

BenjaminBossan Dec 20, 2023
Maintainer

Note that all else being equal, quantizing the model will already lead to a slowdown, regardless of whether LoRA is added or not. Can you compare the speed difference between the two withtout LoRA?

younesbelkada · 2023-12-20T09:08:01Z

younesbelkada
Dec 20, 2023

I think the overhead is expected, when you use Peft models, especially lora, during inference there is an overhead due to the LoRA layers - see figure below

As you perform at the same time the computation on the left and on the right and sum the final results, this creates an overhead that can be quite considerable during generation.

However you can overcome this by "merging" the adapter weights into the base model as LoRA can be simply rewritten as a refactorization of simple matrix multiplication

Therefore you can merge everything in a single weight matrix and retrive base model's performance. You can do that as follows:

model = AutoModelForCausalLM(args.model_name_or_path, trust_remote_code=True, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, args.adapter_path).to(args.device)
+ model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True)

More details about it here: https://huggingface.co/docs/peft/conceptual_guides/lora#merge-lora-weights-into-the-base-model

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference with a PEFT adapter is slow #680

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Inference with a PEFT adapter is slow #680

alexflint Jul 10, 2023

Replies: 3 comments · 2 replies

kira082 Dec 19, 2023

BenjaminBossan Dec 19, 2023 Maintainer

kira082 Dec 19, 2023

BenjaminBossan Dec 20, 2023 Maintainer

younesbelkada Dec 20, 2023

alexflint
Jul 10, 2023

Replies: 3 comments 2 replies

kira082
Dec 19, 2023

BenjaminBossan
Dec 19, 2023
Maintainer

BenjaminBossan Dec 20, 2023
Maintainer

younesbelkada
Dec 20, 2023