Merge pull request #1082 from zigabrencic/docs/open-llm-suport

Support for Open LLMs
AntonOsika · Apr 8, 2024 · d00be76 · d00be76
2 parents 2bda71b + 4e7b072
commit d00be76
Show file tree

Hide file tree

Showing 5 changed files with 198 additions and 6 deletions.
diff --git a/docs/examples/open_llms/README.md b/docs/examples/open_llms/README.md
@@ -0,0 +1,56 @@
+# Test that the Open LLM is running
+
+First start the server by using only CPU:
+
+```bash
+export model_path="TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf"
+python -m llama_cpp.server --model $model_path
+```
+
+Or with GPU support (recommended):
+
+```bash
+python -m llama_cpp.server --model TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf --n_gpu_layers 1
+```
+
+If you have more `GPU` layers available set `--n_gpu_layers` to the higher number.
+
+To find the amount of available  run the above command and look for `llm_load_tensors: offloaded 1/41 layers to GPU` in the output.
+
+## Test API call
+
+Set the environment variables:
+
+```bash
+export OPENAI_API_BASE="http://localhost:8000/v1"
+export OPENAI_API_KEY="sk-xxx"
+export MODEL_NAME="CodeLlama"
+````
+
+Then ping the model via `python` using `OpenAI` API:
+
+```bash
+python examples/open_llms/openai_api_interface.py
+```
+
+If you're not using `CodeLLama` make sure to change the `MODEL_NAME` parameter.
+
+Or using `curl`:
+
+```bash
+curl --request POST \
+     --url http://localhost:8000/v1/chat/completions \
+     --header "Content-Type: application/json" \
+     --data '{ "model": "CodeLlama", "prompt": "Who are you?", "max_tokens": 60}'
+```
+
+If this works also make sure that `langchain` interface works since that's how `gpte` interacts with LLMs.
+
+## Langchain test
+
+```bash
+export MODEL_NAME="CodeLlama"
+python examples/open_llms/langchain_interface.py
+```
+
+That's it 🤓 time to go back [to](/docs/open_models.md#running-the-example) and give `gpte` a try.
diff --git a/docs/examples/open_llms/langchain_interface.py b/docs/examples/open_llms/langchain_interface.py
@@ -0,0 +1,17 @@
+import os
+
+from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
+from langchain_openai import ChatOpenAI
+
+model = ChatOpenAI(
+    model=os.getenv("MODEL_NAME"),
+    temperature=0.1,
+    callbacks=[StreamingStdOutCallbackHandler()],
+    streaming=True,
+)
+
+prompt = (
+    "Provide me with only the code for a simple python function that sums two numbers."
+)
+
+model.invoke(prompt)
diff --git a/docs/examples/open_llms/openai_api_interface.py b/docs/examples/open_llms/openai_api_interface.py
@@ -0,0 +1,21 @@
+import os
+
+from openai import OpenAI
+
+client = OpenAI(
+    base_url=os.getenv("OPENAI_API_BASE"), api_key=os.getenv("OPENAI_API_KEY")
+)
+
+response = client.chat.completions.create(
+    model=os.getenv("MODEL_NAME"),
+    messages=[
+        {
+            "role": "user",
+            "content": "Provide me with only the code for a simple python function that sums two numbers.",
+        },
+    ],
+    temperature=0.7,
+    max_tokens=200,
+)
+
+print(response.choices[0].message.content)
diff --git a/docs/open_models.md b/docs/open_models.md
@@ -1,24 +1,119 @@
 Using with open/local models
 ============================
 
-You can integrate `gpt-engineer` with open-source models by leveraging an OpenAI-compatible API. One such API is provided by the [text-generator-ui _extension_ openai](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md).
+**Use `gpte` first with OpenAI models to get a feel for the `gpte` tool.**
+
+**Then go play with experimental Open LLMs 🐉 support and try not to get 🔥!!**
+
+At the moment the best option for coding is still the use of `gpt-4` models provided by OpenAI. But open models are catching up and are a good free and privacy-oriented alternative if you possess the proper hardware.
+
+You can integrate `gpt-engineer` with open-source models by leveraging an OpenAI-compatible API.
+
+We provide the minimal and cleanest solution below. What is described is not the only way to use open/local models, but the one we tested and would recommend to most users.
+
+More details on why the solution below is recommended in [this blog post](https://zigabrencic.com/blog/2024-02-21).
 
 Setup
 -----
 
-To get started, first set up the API with the Runpod template, as per the [instructions](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md).
+For inference engine we recommend for the users to use [llama.cpp](https://github.com/ggerganov/llama.cpp) with its `python` bindings `llama-cpp-python`.
+
+We choose `llama.cpp` because:
+
+- 1.) It supports the largest amount of hardware acceleration backends.
+- 2.) It supports the diverse set of open LLMs.
+- 3.) Is written in `python` and directly on top of `llama.cpp` inference engine.
+- 4.) Supports the `openAI` API and `langchain` interface.
+
+To install `llama-cpp-python` follow the official [installation docs](https://llama-cpp-python.readthedocs.io/en/latest/) and [those docs](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/) for MacOS with Metal support.
+
+If you want to benefit from proper hardware acceleration on your machine make sure to set up the proper compiler flags before installing your package.
+
+- `linux`: `CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"`
+- `macos` with Metal support: `CMAKE_ARGS="-DLLAMA_METAL=on"`
+- `windows`: `$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"`
+
+This will enable the `pip` installer to compile the `llama.cpp` with the proper hardware acceleration backend.
+
+Then run:
+
+```bash
+pip install llama-cpp-python
+```
+
+For our use case we also need to set up the web server that `llama-cpp-python` library provides. To install:
+
+```bash
+pip install 'llama-cpp-python[server]'
+```
+
+For detailed use consult the [`llama-cpp-python` docs](https://llama-cpp-python.readthedocs.io/en/latest/server/).
+
+Before we proceed we need to obtain the model weights in the `gguf` format. That should be a single file on your disk.
+
+In case you have weights in other formats check the `llama-cpp-python` docs for conversion to `gguf` format.
+
+Models in other formats `ggml`, `.safetensors`, etc. won't work without prior conversion to `gguf` file format with the solution described below!
+
+Which open model to use?
+==================
+
+Your best choice would be:
+
+- CodeLlama 70B
+- Mixtral 8x7B
+
+We are still testing this part, but the larger the model you can run the better. Sure the responses might be slower in terms of (token/s), but code quality will be higher.
+
+For testing that the open LLM `gpte` setup works we recommend starting with a smaller model. You can download weights of [CodeLlama-13B-GGUF by the `TheBloke`](https://huggingface.co/TheBloke/CodeLlama-13B-GGUF) choose the largest model version you can run (for example `Q6_K`), since quantisation will degrade LLM performance.
+
+Feel free to try out larger models on your hardware and see what happens.
 
 Running the Example
--------------------
+==================
+
+To see that your setup works check [test open LLM setup](examples/test_open_llm/README.md).
+
+If above tests work proceed 😉
+
+For checking that `gpte` works with the `CodeLLama` we recommend for you to create a project with `prompt` file content:
+
+```
+Write a python script that sums up two numbers. Provide only the `sum_two_numbers` function and nothing else.
+
+Provide two tests:
 
-Once the API is set up, you can find the host and the exposed TCP port by checking your Runpod dashboard.
+assert(sum_two_numbers(100, 10) == 110)
+assert(sum_two_numbers(10.1, 10) == 20.1)
+```
 
-Then, you can use the port and host to run the following example using WizardCoder-Python-34B hosted on Runpod:
+Now run the LLM in separate terminal:
 
+```bash
+python -m llama_cpp.server --model $model_path --n_batch 256 --n_gpu_layers 30
 ```
-  OPENAI_API_BASE=http://<host>:<port>/v1 python -m gpt_engineer.cli.main benchmark/pomodoro_timer --steps benchmark TheBloke_WizardCoder-Python-34B-V1.0-GPTQ
+
+Then in another terminal window set the following environment variables:
+
+```bash
+export OPENAI_API_BASE="http://localhost:8000/v1"
+export OPENAI_API_KEY="sk-xxx"
+export MODEL_NAME="CodeLLama"
+export LOCAL_MODEL=true
 ```
 
+And run `gpt-engineer` with the following command:
+
+```bash
+gpte <project_dir> $MODEL_NAME --lite --temperature 0.1
+```
+
+The `--lite` mode is needed for now since open models for some reason behave worse with too many instructions at the moment. Temperature is set to `0.1` to get consistent best possible results.
+
+That's it.
+
+*If sth. doesn't work as expected, or you figure out how to improve the open LLM support please let us know.*
+
 Using Azure models
 ==================
 

diff --git a/gpt_engineer/applications/cli/main.py b/gpt_engineer/applications/cli/main.py
@@ -76,6 +76,7 @@ def load_env_if_needed():
         load_dotenv()
     if os.getenv("OPENAI_API_KEY") is None:
         load_dotenv(dotenv_path=os.path.join(os.getcwd(), ".env"))
+
     openai.api_key = os.getenv("OPENAI_API_KEY")
 
     if os.getenv("ANTHROPIC_API_KEY") is None:
@@ -480,6 +481,8 @@ def main(
 
     if ai.token_usage_log.is_openai_model():
         print("Total api cost: $ ", ai.token_usage_log.usage_cost())
+    elif os.getenv("LOCAL_MODEL"):
+        print("Total api cost: $ 0.0 since we are using local LLM.")
     else:
         print("Total tokens used: ", ai.token_usage_log.total_tokens())