Replies: 1 comment 4 replies
-
Thankfully, it seems that there is some work being done to make llama.cpp work with other programs. Crossing my fingers we can use llama.cpp on text-generation-webui in the near future. Running on an i7-12700KF I get: 500 ms/token -> 30B Model And since I am limited to 8GB VRAM, it is the only way for me and probably the vast majority of people to run a model larger than 7b. Implementing support for llama.cpp could be the gateway to much higher adoption of text-generation-webui since the default user experience of llama.cpp is lacking. The text-generation-webui could allow much better and more advanced use of the model which would push the boundaries of what's conceivably possible on consumer hardware. |
Beta Was this translation helpful? Give feedback.
-
Hello,
https://github.com/ggerganov/llama.cpp
I wanted to know if you would be willing to integrate llama.cpp into your webui. With this implementation, we would be able to run the 4-bit version of the llama 30B with just 20 GB of RAM (no gpu required), and only 4 GB of RAM would be needed for the 7B (4-bit) model. Combining your repository with ggerganov's would provide us with the best of both worlds.
If anyone is wondering what's the speed we can get for using an only CPU interface, I got those averages for my Intel Core i7 10700K :
160 ms/token -> 7B Model
350 ms/token -> 13B Model
760 ms/token -> 30B Model
Beta Was this translation helpful? Give feedback.
All reactions