Is GPTQ quantization limited to one GPU? #2081

dblacknc · 2023-05-15T17:19:16Z

dblacknc
May 15, 2023

I've looked through a number of GPTQ forks and so far found nothing on this, so thought to ask here: The quantization (compression) examples all show e.g. CUDA_VISIBLE_DEVICES=0 for that step, then multiple devices for benchmark and inference. E.g. here under the language generation section: https://github.com/qwopqwop200/GPTQ-for-LLaMa

I have plenty of CPU RAM, yet seemingly can't quantize llama-30b on just one of my 3060 (12 GB) GPUs. If that process could be split among GPUs, I think it'd fit into two of them though, much like using --auto-devices can for inference with almost all models.

Is there a fundamental limitation with quantizing having to run on just one GPU?

NostalgicStone · 2023-05-19T15:42:19Z

NostalgicStone
May 19, 2023

I too, would love to know this.
Additionally, is could one use 2 different cards?
As in 2x Nvidia cards, but one is for instance an RTX 3080ti (12gb) and another one a Tesla M40 (24gb) or K80 (24gb [12gb x2] ) ?
The ability to mix and match would help a lot of testing on more home rigs...

0 replies

SuperBruceJia · 2024-07-31T20:54:03Z

SuperBruceJia
Jul 31, 2024

Are there any solutions?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is GPTQ quantization limited to one GPU? #2081

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Is GPTQ quantization limited to one GPU? #2081

dblacknc May 15, 2023

Replies: 2 comments

NostalgicStone May 19, 2023

SuperBruceJia Jul 31, 2024

dblacknc
May 15, 2023

NostalgicStone
May 19, 2023

SuperBruceJia
Jul 31, 2024