Inference with GPU took too much GPU RAM #16

DungMinhDao · 2023-06-01T06:54:27Z

I tried inferencing with GPU, after making some modifications to the code:

llama/memory_pool.py:        self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider'])

I found all the files with import onnxruntime and add import torch before it (to make sure it loaded all the necessary CUDA related libs). Also I uninstalled onnxruntime and install onnxruntime-gpu instead.
It ran fast but 34GB GPU memory for me to load the model. I tried changing the --poolsize to lower but the situation didn't change (and with --poolsize less than 10 some parts of the model can't be loaded into either GPU or CPU)

The text was updated successfully, but these errors were encountered:

tpoisonooo · 2023-06-05T03:15:09Z

1B param needs ~4GB memory with fp32 format. So 34GB = 26GB llama weight + 6GB extra.

I guess that using fp16 mode to minimize llama weight to 13 GB, it should be 13GB + 6GB = 19GB.

Let me test poolsize<10 on A100 GPU later.

tpoisonooo · 2023-06-05T03:16:41Z

@DungMinhDao

DungMinhDao · 2023-06-05T04:50:03Z

Thanks for replying. I don't know but somehow it still uses 34GB even when I switched to the fp16 branch of the HuggingFace model weights you linked to, and I specified the ${FP16_ONNX_DIR}. Can you check if the memory_pool is implemented for FP16 usage on GPU, or what command should I run for using the FP16 model on GPU? Many thanks.

iamhere1 · 2023-06-13T06:38:34Z

@tpoisonooo Thank you for your great work! However, I have the same problem as @DungMinhDao . I convert the model(7B ) to fp16 using the tool script, https://github.com/tpoisonooo/llama.onnx/blob/main/tools/convert-fp32-to-fp16.py, the model size is half as orinal fp32 model, but the 32G memory is still not enough to load the fp16 model. Is there anything wrong?

tpoisonooo mentioned this issue Jul 6, 2023

GPU Inference #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference with GPU took too much GPU RAM #16

Inference with GPU took too much GPU RAM #16

DungMinhDao commented Jun 1, 2023

tpoisonooo commented Jun 5, 2023 •

edited

Loading

tpoisonooo commented Jun 5, 2023

DungMinhDao commented Jun 5, 2023

iamhere1 commented Jun 13, 2023 •

edited

Loading

Inference with GPU took too much GPU RAM #16

Inference with GPU took too much GPU RAM #16

Comments

DungMinhDao commented Jun 1, 2023

tpoisonooo commented Jun 5, 2023 • edited Loading

tpoisonooo commented Jun 5, 2023

DungMinhDao commented Jun 5, 2023

iamhere1 commented Jun 13, 2023 • edited Loading

tpoisonooo commented Jun 5, 2023 •

edited

Loading

iamhere1 commented Jun 13, 2023 •

edited

Loading