Skip to content

Latest commit

 

History

History
60 lines (32 loc) · 2.16 KB

localLLMInCPU.md

File metadata and controls

60 lines (32 loc) · 2.16 KB

Introduction

This document gives instructions on how to run a large language model (LLM) using CPU only. Running an LLM on a high end GPU will give faster inference, but not everyone has access to this type of hardware.

The instructions will allow you to run any quantized fine-tuned llama based model in GGML format. GGML is a binary encoding method that allows for many different quantization levels.

To get the best accuracy / size ratio I'll be using the 5 bit quantized models, but the 4 bit models will give similar (slightly degraded) results.

There are many models to chose from. Since Meta has recently released their Llama 2 model, I'll use this in the instructions.

LLaMa.cpp

LLaMa.cpp An implementation of the LLaMa inference engine written in C/C++. It supports running models in CPU and on Apple M1/M2 GPUs.

To download and build (requires gcc + make):

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

The model

We'll be using a 5 bit quantized GGML version of the llama 2 chat model:

Download the .bin file by following this link

Then click Files and versions

Files and versions

And Download the llama-2-13b-chat.ggmlv3.q5_K_M.bin file.

Download file

Copy this file into the llama.cpp/models directory

Running the model

Before running the model, it's important to see if there is a recommended prompt template. Depending on how they model is trained and fine-tuned, you will get more accuarate results by using the recommended prompt template.

Click model card on the hugging face model page.

model card

Scroll down until you see the prompt template.

You can modify the template to insert your own questions / instructions

Here is an example of a script I use to test the models.

Download this scripts and modify it as appropriate.

Then, run the scripts:

./promptLlama2.sh