-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference with GPU took too much GPU RAM #16
Comments
1B param needs ~4GB memory with fp32 format. So 34GB = 26GB llama weight + 6GB extra. I guess that using fp16 mode to minimize llama weight to 13 GB, it should be 13GB + 6GB = 19GB. Let me test poolsize<10 on A100 GPU later. |
Thanks for replying. I don't know but somehow it still uses 34GB even when I switched to the |
@tpoisonooo Thank you for your great work! However, I have the same problem as @DungMinhDao . I convert the model(7B ) to fp16 using the tool script, https://github.com/tpoisonooo/llama.onnx/blob/main/tools/convert-fp32-to-fp16.py, the model size is half as orinal fp32 model, but the 32G memory is still not enough to load the fp16 model. Is there anything wrong? |
I tried inferencing with GPU, after making some modifications to the code:
I found all the files with
import onnxruntime
and addimport torch
before it (to make sure it loaded all the necessary CUDA related libs). Also I uninstalledonnxruntime
and installonnxruntime-gpu
instead.It ran fast but 34GB GPU memory for me to load the model. I tried changing the
--poolsize
to lower but the situation didn't change (and with--poolsize
less than 10 some parts of the model can't be loaded into either GPU or CPU)The text was updated successfully, but these errors were encountered: