Unified memory #5898

ClericerQ · 2024-04-21T09:58:18Z

ClericerQ
Apr 21, 2024

Hey everyone,

Is there an integrated way to use Unified Memory to share system RAM with the graphics card for loading larger models (NVIDIA Cuda)? I'm aware that this might reduce performance, but the alternative—using the CPU—is not an option. With the CPU, I get only 1-3 tokens/s, compared to 30 tokens/s normally. Even if Unified Memory reduces it to 10 tokens/s, it's still much faster than using the CPU.

viperwasp · 2024-04-21T16:00:04Z

viperwasp
Apr 21, 2024

I believe so. In fact is that not what BLAS is? Run a GGUF model on llama.cpp as Model Loader. Make sure you also toggle on Tensorcores if applicable. Set n-gpu-layers so somewhere around 10-25 I usually find depending on the model in question. It will load into your GPU memory and then offload into system RAM/CPU. I suggest never letting your Video Memory get more than 98% full. Try to aim for 90-93% while the model is running and generating a response. If it gets to 99% or 100% it can freeze etc. If that happens lower n-gpu-layers by 1 until it never maxes out video memory etc.

It's a lot slower than GPU only but I can run 70B on my 4080 still only at about 2K context. Maybe 4k? And on smaller models I can enjoy 8-32K context.

Now for how to set this all up? It may be a bit complicated. I don't know how much the auto installer does these days but even a year ago I had to do most of it manually. Guides here or there for it in these Discussions. You likely have to install other programs and I think you must have nvidia GPU too.

1 reply

viperwasp Apr 21, 2024

#2774
Here this may not be the only link you need to read. But it at least will expose you to some of what you need to do etc. It may have changed. You need the correct versions for Visual Studios, Nvidia Toolkit. Find a guide that goes though that. If anyone who is a pro is reading this and some of it is pointless now. Maybe automated into the initial installer. Just let us know. lol Blas still works for me I've not had to set this up on a new PC.

michusx · 2024-04-22T18:24:10Z

michusx
Apr 22, 2024

In fact because of Nvidia driver default settings GPU shared memory with system RAM is enabled by default. So, if LLM doesn't fit into VRAM graphics card will automatically use system memory for expansion. But you will soon find out how slow is this solution - communication between VRAM a system RAM over PCIE is serious bottleneck and this is not decrease from 30t/s to 10t/s but rather to 1-2t/s. However comparing to CPU only inference you will see substantial improvement in prompt processing time - it can be dozens of time faster.
From my experience GGUF with llama_cpp and hybrid configuration GPU+CPU can work even better than GPU-only with system shared memory...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified memory #5898

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unified memory #5898

ClericerQ Apr 21, 2024

Replies: 2 comments · 1 reply

viperwasp Apr 21, 2024

viperwasp Apr 21, 2024

michusx Apr 22, 2024

ClericerQ
Apr 21, 2024

Replies: 2 comments 1 reply

viperwasp
Apr 21, 2024

michusx
Apr 22, 2024