A LeapfrogAI API-compatible vllm wrapper for quantized and un-quantized model inferencing across GPU infrastructures.
See the LeapfrogAI documentation website for system requirements and dependencies.
- LeapfrogAI API for a fully RESTful application
The default model that comes with this backend in this repository's officially released images is a 4-bit quantization of the Synthia-7b model.
All of the commands in this sub-section are executed within this packages/vllm
sub-directory.
Optionally, you can specify a different model during Zarf creation:
uds zarf package create --confirm --set MODEL_REPO_ID=defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g --set MODEL_REVISION=main
If you decide to use a different model, there will likely be a need to change generation and engine runtime configurations, please see the Zarf Package Config and the values override file for details on what runtime parameters can be modified. These parameters are model-specific, and can be found in the HuggingFace model cards and/or configuration files (e.g., prompt templates).
For example, during Zarf deployment, you can override the Zarf Package Config defaults by doing the following:
uds zarf package deploy zarf-package-vllm-amd64-dev.tar.zst --confirm --set ENFORCE_EAGER=True
To build and deploy the vllm backend Zarf package into an existing UDS Kubernetes cluster:
Important
Execute the following commands from the root of the LeapfrogAI repository
pip install 'huggingface_hub[cli,hf_transfer]' # Used to download the model weights from huggingface
make build-vllm LOCAL_VERSION=dev
uds zarf package deploy packages/vllm/zarf-package-vllm-*-dev.tar.zst --confirm
In local development the config.yaml and .env.example must be modified if the model has changed away from the default. The LeapfrogAI SDK picks up the config.yaml
automatically, and the .env
must be sourced into the Python environment.
Important
Execute the following commands from this sub-directory
Create a .env
file based on the .env.example
:
cp .env.example .env
source .env
As necessary, modify the existing config.yaml
:
vim config.yaml
To run the vllm backend locally:
# Install dev and runtime dependencies
make install
# Clone Model
python src/model_download.py
# Start the model backend
make dev
To run the Docker container, use the following Makefile commands. LOCAL_VERSION
must be consistent across the two Make commands.
In the root of the LeapfrogAI repository:
LOCAL_VERSION=dev make sdk-wheel
In the root of this vLLM sub-directory:
LOCAL_VERSION=dev make docker