v100显卡，加载量化模型Yi-34B-Chat-4bits，推理速度很慢 #484

zxdposter · 2024-04-03T06:21:26Z

Reminder

I have searched the Github Discussion and issues and have not found anything similar to this.

Environment

- OS: Centos7.9
- Python: 3.11.6
- PyTorch: 2.1.2
- CUDA: 12.4

Current Behavior

v100显卡，加载量化模型Yi-34B-Chat-4bits，推理速度很慢，要200秒左右，显存占了20G，还有10G空余
请问有办法解决吗？

我查过 issue，有几个人遇到了，但是都没有解决方法。

Expected Behavior

No response

Steps to Reproduce

model_id = '/home/Yi-34B-Chat-4bits'

nf4_config = GPTQConfig(
    bits=4,
    use_exllama=True,
    max_input_length=2048,
    use_cuda_fp16=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=nf4_config,
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, trust_remote_code=True)

DEVICE = "cuda"
DEVICE_ID = "0"
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
device_use = torch.device(CUDA_DEVICE)
model = model.to(device_use)

def chat(input):
    input_ids = tokenizer.apply_chat_template(conversation=[{"role": "user", "content": input}], tokenize=True,
                                              add_generation_prompt=True,
                                              return_tensors='pt')
    output_ids = model.generate(input_ids.to(device_use))
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
    return response

time1 = time.time()
result = chat('你是谁')
print(f'耗时{time.time() - time1}', result)

输出结果：
耗时194.49595594406128 我是零一万物开发的一个智能助手，我叫 Yi，我是由零一万物的研究员们通过大量的文本数据进行训练，学习了语言的各种模式和关联，从而能够生成文本、回答问题、翻译语言的。我可以帮助用户解答问题、提供信息，以及进行各种语言相关的任务。我并不是一个真实的人，而是由代码和算法构成的，但我尽力模仿人类的交流方式，以便更好地与用户互动。如果你有任何问题或需要帮助，请随时告诉我！

Anything Else?

No response

lyan62 · 2024-04-18T17:04:01Z

@zxdposter 你好请问34B inference需要几张显卡？需要多卡吗？

ffhelly · 2024-04-24T10:08:57Z

same too. 8x 4090 . so slow.

devillaws · 2024-04-26T08:05:38Z

你好，请问你的gptq版本是多少，官网没看到针对pytorch2.1.2的autogptq版本耶

zxdposter · 2024-05-03T09:01:50Z

@lyan62 大概需要20-30G显存

GoodDayUp · 2024-05-10T01:42:33Z

确实太慢了，有什么好的方法吗

ChinesePainting · 2024-05-20T04:46:01Z

可能是你直接pip install -r requirements.txt导致的torch不可用。
你检查下torch能不能用，或者启动模型时是不是有CUDA extension not installed.
我重新配了个环境解决了：
1.把requirements.txt里的torch那行去掉。
2.找对应你CUDA版本的pytorch版本，比如我cuda11.8.我看到gptq最低支持到pytorch2.1.0。
3.下面是我所有的安装命令：
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
（https://pytorch.org/get-started/previous-versions/）
pip install -r requirements.txt
（txt已经去掉了torch）
pip install auto-gptq==0.5.1 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
（https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md）
pip install --upgrade transformers optimum
（因为这里显示我没有optimum库，一起把这俩更新保证兼容）
然后我就发现比之前快的多

zxdposter · 2024-07-19T08:31:58Z

可能是你直接pip install -r requirements.txt导致的torch不可用。你检查下torch能不能用，或者启动模型时是不是有CUDA extension not installed. 我重新配了个环境解决了： 1.把requirements.txt里的torch那行去掉。 2.找对应你CUDA版本的pytorch版本，比如我cuda11.8.我看到gptq最低支持到pytorch2.1.0。 3.下面是我所有的安装命令： conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia （https://pytorch.org/get-started/previous-versions/） pip install -r requirements.txt （txt已经去掉了torch） pip install auto-gptq==0.5.1 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ （https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md） pip install --upgrade transformers optimum （因为这里显示我没有optimum库，一起把这俩更新保证兼容）然后我就发现比之前快的多

@ChinesePainting 感谢提供解决方法，后续我尝试一下。

Haijian06 added the question Further information is requested label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v100显卡，加载量化模型Yi-34B-Chat-4bits，推理速度很慢 #484

v100显卡，加载量化模型Yi-34B-Chat-4bits，推理速度很慢 #484

zxdposter commented Apr 3, 2024 •

edited

Loading

lyan62 commented Apr 18, 2024 •

edited

Loading

ffhelly commented Apr 24, 2024

devillaws commented Apr 26, 2024

zxdposter commented May 3, 2024

GoodDayUp commented May 10, 2024

ChinesePainting commented May 20, 2024

zxdposter commented Jul 19, 2024

v100显卡，加载量化模型Yi-34B-Chat-4bits，推理速度很慢 #484

v100显卡，加载量化模型Yi-34B-Chat-4bits，推理速度很慢 #484

Comments

zxdposter commented Apr 3, 2024 • edited Loading

Reminder

Environment

Current Behavior

Expected Behavior

Steps to Reproduce

Anything Else?

lyan62 commented Apr 18, 2024 • edited Loading

ffhelly commented Apr 24, 2024

devillaws commented Apr 26, 2024

zxdposter commented May 3, 2024

GoodDayUp commented May 10, 2024

ChinesePainting commented May 20, 2024

zxdposter commented Jul 19, 2024

zxdposter commented Apr 3, 2024 •

edited

Loading

lyan62 commented Apr 18, 2024 •

edited

Loading