# For Mac (CPU and GPU), windows (CPU and CUDA), or linux (CPU and CUDA)
llm_client="*"
This will download and build llama.cpp. See build.md for other features and backends like mistral.rs.
use Llmclient::prelude::*;
// Loads the largest quant available based on your VRAM or system memory
let llm_client = LlmClient::llama_cpp()
.mistral7b_instruct_v0_3() // Uses a preset model
.init() // Downloads model from hugging face and starts the inference interface
.await?;
Several of the most common models are available as presets. Loading from local models is also fully supported. See models.md for more information.
- Automated build and support for CPU, CUDA, MacOS
- Easy model presets and quant selection
- Novel cascading prompt workflow for CoT and NLP workflows. DIY workflow creation supported!
- Breadth of configuration options (sampler params, retry logic, prompt caching, logit bias, grammars, etc)
- API support for OpenAI, Anthropic, Perplexity, and any OpenAI compatible API
In addition to basic LLM inference, llm_client is primarily designed for controlled generation using step based cascade workflows. This prompting system runs pre-defined workflows that control and constrain both the overall structure of generation and individual tokens during inference. This allows the implementation of specialized workflows for specific tasks, shaping LLM outputs towards intended, reproducible outcomes.
let response: u32 = llm_client.reason().integer()
.instructions()
.set_content("Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?")
.return_primitive().await?;
// Recieve 'primitive' outputs
assert_eq!(response, 1)
This runs the reason one round cascading prompt workflow with an integer output.
This method significantly improves the reliability of LLM use cases. For example, there are test cases this repo that can be used to benchmark an LLM. There is a large increase in accuracy when comparing basic inference with a constrained outcome and a CoT style cascading prompt workflow. The decision workflow that runs N count of CoT workflows across a temperature gradient approaches 100% accuracy for the test cases.
I have a full breakdown of this in my blog post, "Step-Based Cascading Prompts: Deterministic Signals from the LLM Vibe Space."
Jump to the readme.md of the llm_client crate to find out how to use them.
- device config - customizing your inference config
- basic completion - the most basic request available
- basic primitive - returns the request primitive
- reason - a cascade workflow that performs CoT reasoning before returning a primitive
- decision - uses the reason workflow N times across a temperature gradient
- extract urls - a cascade workflow that extracts all URLs from text that meet a predict
- Improve the Cascading workflow API to be easier.
- Refactor the benchmarks module for easy model comparison.
- WebUI client for local consumption.
- Server mode for "LLM-in-a-box" deployments
- Full Rust inference via mistral.rs or candle.
- llm_utils is a sibling crate that was split from the llm_client. If you just need prompting, tokenization, model loading, etc, I suggest using the llm_utils crate on it's own.
- llm_interface is a sub-crate of llm_client. It is the backend for LLM inference.
- llm_devices is a sub-crate of llm_client. It contains device and build managment behavior.
- llama.cpp is used in server mode for LLM inference as the current default.
- mistral.rs is available for basic use, but is a WIP.
Shelby Jenkins - Here or Linkedin