-
I'm trying to do a robust quantization over 34 languages, so I have a rather large dataset for the llama-imatrix file. On my m3 Max macbook, this will take >260hrs, and I've had some crashes it seems every time I wake the screensaver, so I decided to try using vast.ai hosting. I am using the ghcr.io/ggerganov/llama.cpp:full-cuda image from this repo. Something definitely is amiss though, On my macbook m3 max, the ETA to complete from llama-imatrix is ~260hrs The first time I tries, the system I chose was an AMD Epic with 4x4090. the ETA was 660 hrs. I was expecting ~8 fold improvement, not ~3 fold slowdown. I used a script to execute the command though, and it didn't record the commands used, I worried it might not have captured the number of gpus correctly and ended up passing bad params. So, I tried again. This time on a Xeon server with 4x4090s.
This ETA was 636 hours, sorry I cut it off on the version of the log that I downloaded. Why is it taking almost 3x as long to do this on a 4x4090 machine as it is locally? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
The CUDA backend also does not support BF16, so most of the model is running on the CPU. Try a F16 model instead. Also note that |
Beta Was this translation helpful? Give feedback.
The CUDA backend also does not support BF16, so most of the model is running on the CPU. Try a F16 model instead.
Also note that
-ngl -1
does not work the way you might expect, no layers will be offloaded that way. Use a large number to offload the entire model instead, eg.-ngl 99
.