Replies: 2 comments 1 reply
-
I believe so. In fact is that not what BLAS is? Run a GGUF model on llama.cpp as Model Loader. Make sure you also toggle on Tensorcores if applicable. Set n-gpu-layers so somewhere around 10-25 I usually find depending on the model in question. It will load into your GPU memory and then offload into system RAM/CPU. I suggest never letting your Video Memory get more than 98% full. Try to aim for 90-93% while the model is running and generating a response. If it gets to 99% or 100% it can freeze etc. If that happens lower n-gpu-layers by 1 until it never maxes out video memory etc. It's a lot slower than GPU only but I can run 70B on my 4080 still only at about 2K context. Maybe 4k? And on smaller models I can enjoy 8-32K context. Now for how to set this all up? It may be a bit complicated. I don't know how much the auto installer does these days but even a year ago I had to do most of it manually. Guides here or there for it in these Discussions. You likely have to install other programs and I think you must have nvidia GPU too. |
Beta Was this translation helpful? Give feedback.
-
In fact because of Nvidia driver default settings GPU shared memory with system RAM is enabled by default. So, if LLM doesn't fit into VRAM graphics card will automatically use system memory for expansion. But you will soon find out how slow is this solution - communication between VRAM a system RAM over PCIE is serious bottleneck and this is not decrease from 30t/s to 10t/s but rather to 1-2t/s. However comparing to CPU only inference you will see substantial improvement in prompt processing time - it can be dozens of time faster. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone,
Is there an integrated way to use Unified Memory to share system RAM with the graphics card for loading larger models (NVIDIA Cuda)? I'm aware that this might reduce performance, but the alternative—using the CPU—is not an option. With the CPU, I get only 1-3 tokens/s, compared to 30 tokens/s normally. Even if Unified Memory reduces it to 10 tokens/s, it's still much faster than using the CPU.
Beta Was this translation helpful? Give feedback.
All reactions