You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INFO:Loading 7B-hf...
INFO:Loaded the model in 6.34 seconds.
Output generated in 9.37 seconds (21.24 tokens/s, 199 tokens, context 27, seed 1005554483)
8-bit BitsAndBytes inference with 7.9GB VRAM:
INFO:Loading 7B-hf...
INFO:Loaded the model in 6.65 seconds.
Output generated in 40.70 seconds (4.89 tokens/s, 199 tokens, context 6, seed 603994963)
4-bit GPTQ inference with 4.5GB VRAM:
INFO:Loading Wizard-Vicuna-7B-Uncensored-GPTQ...
INFO:Found the following quantized model: models/Wizard-Vicuna-7B-Uncensored-GPTQ/Wizard-Vicuna-7B-Uncensored-GPTQ-4bit-128g.no-act-order.safetensors
INFO:Loaded the model in 2.10 seconds.
Output generated in 17.17 seconds (17.42 tokens/s, 299 tokens, context 52, seed 722332956)
8-bit BitsAndBytes has low performance at the moment, so GPTQ is recommended.
4-bit BitsAndBytes inference and GPTQ quantization doesn't work for now, llama.cpp is not guaranteed to work either.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
ROCm 5.5.0 needs to be installed beforehand, read this if you haven't done it.
Transformers inference with 13.6GB VRAM:
8-bit BitsAndBytes inference with 7.9GB VRAM:
4-bit GPTQ inference with 4.5GB VRAM:
8-bit BitsAndBytes has low performance at the moment, so GPTQ is recommended.
4-bit BitsAndBytes inference and GPTQ quantization doesn't work for now, llama.cpp is not guaranteed to work either.
The Dockerfile can be found here. For more info, please visit https://github.com/evshiron/rocm_lab.
References:
Beta Was this translation helpful? Give feedback.
All reactions