Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v100显卡,加载量化模型Yi-34B-Chat-4bits,推理速度很慢 #484

Open
1 task done
zxdposter opened this issue Apr 3, 2024 · 7 comments
Open
1 task done
Labels
question Further information is requested

Comments

@zxdposter
Copy link

zxdposter commented Apr 3, 2024

Reminder

  • I have searched the Github Discussion and issues and have not found anything similar to this.

Environment

- OS: Centos7.9
- Python: 3.11.6
- PyTorch: 2.1.2
- CUDA: 12.4

Current Behavior

v100显卡,加载量化模型Yi-34B-Chat-4bits,推理速度很慢,要200秒左右,显存占了20G,还有10G空余
请问有办法解决吗?

我查过 issue,有几个人遇到了,但是都没有解决方法。

Expected Behavior

No response

Steps to Reproduce

model_id = '/home/Yi-34B-Chat-4bits'

nf4_config = GPTQConfig(
    bits=4,
    use_exllama=True,
    max_input_length=2048,
    use_cuda_fp16=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=nf4_config,
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, trust_remote_code=True)

DEVICE = "cuda"
DEVICE_ID = "0"
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
device_use = torch.device(CUDA_DEVICE)
model = model.to(device_use)

def chat(input):
    input_ids = tokenizer.apply_chat_template(conversation=[{"role": "user", "content": input}], tokenize=True,
                                              add_generation_prompt=True,
                                              return_tensors='pt')
    output_ids = model.generate(input_ids.to(device_use))
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
    return response

time1 = time.time()
result = chat('你是谁')
print(f'耗时{time.time() - time1}', result)

输出结果:
耗时194.49595594406128 我是零一万物开发的一个智能助手,我叫 Yi,我是由零一万物的研究员们通过大量的文本数据进行训练,学习了语言的各种模式和关联,从而能够生成文本、回答问题、翻译语言的。我可以帮助用户解答问题、提供信息,以及进行各种语言相关的任务。我并不是一个真实的人,而是由代码和算法构成的,但我尽力模仿人类的交流方式,以便更好地与用户互动。如果你有任何问题或需要帮助,请随时告诉我!

Anything Else?

No response

@lyan62
Copy link

lyan62 commented Apr 18, 2024

@zxdposter 你好请问34B inference需要几张显卡?需要多卡吗?

@ffhelly
Copy link

ffhelly commented Apr 24, 2024

same too. 8x 4090 . so slow.

@devillaws
Copy link

你好,请问你的gptq版本是多少,官网没看到针对pytorch2.1.2的autogptq版本耶

@zxdposter
Copy link
Author

@lyan62 大概需要20-30G显存

@GoodDayUp
Copy link

确实太慢了,有什么好的方法吗

@ChinesePainting
Copy link

可能是你直接pip install -r requirements.txt导致的torch不可用。
你检查下torch能不能用,或者启动模型时是不是有CUDA extension not installed.
我重新配了个环境解决了:
1.把requirements.txt里的torch那行去掉。
2.找对应你CUDA版本的pytorch版本,比如我cuda11.8.我看到gptq最低支持到pytorch2.1.0。
3.下面是我所有的安装命令:
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
https://pytorch.org/get-started/previous-versions/)
pip install -r requirements.txt
(txt已经去掉了torch)
pip install auto-gptq==0.5.1 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md)
pip install --upgrade transformers optimum
(因为这里显示我没有optimum库,一起把这俩更新保证兼容)
然后我就发现比之前快的多

@zxdposter
Copy link
Author

可能是你直接pip install -r requirements.txt导致的torch不可用。 你检查下torch能不能用,或者启动模型时是不是有CUDA extension not installed. 我重新配了个环境解决了: 1.把requirements.txt里的torch那行去掉。 2.找对应你CUDA版本的pytorch版本,比如我cuda11.8.我看到gptq最低支持到pytorch2.1.0。 3.下面是我所有的安装命令: conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia (https://pytorch.org/get-started/previous-versions/) pip install -r requirements.txt (txt已经去掉了torch) pip install auto-gptq==0.5.1 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md) pip install --upgrade transformers optimum (因为这里显示我没有optimum库,一起把这俩更新保证兼容) 然后我就发现比之前快的多

@ChinesePainting 感谢提供解决方法,后续我尝试一下。

@Haijian06 Haijian06 added the question Further information is requested label Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants