-
-
Notifications
You must be signed in to change notification settings - Fork 140
2. Usage
There are two ways to use Aphrodite Engine, via the OpenAI API server, or using it via the provided LLM
class.
Aphrodite provides 2 REST API servers, OpenAI and KoboldAI. Below are examples of running the Mistral 7b on 2 GPUs:
aphrodite run meta-llama/Meta-Llama-3-8B -tp 2
You can query the server via curl:
curl http://localhost:2242/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"prompt": "Every age it seems is tainted by the greed of men. Rubbish to one such as I,",
"stream": false,
"mirostat_mode": 2,
"mirostat_tau": 6.5,
"mirostat_eta": 0.2
}'
Simply launch the OpenAI endpoint with --launch-kobold-api
flag.
And the curl request:
curl -X 'POST' \
'http://localhost:2242/api/v1/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Niko the kobold stalked carefully down the alley, his small scaly figure obscured by a dusky cloak that fluttered lightly in the cold winter breeze.",
"max_context_length": 32768,
"max_length": 512,
"stream": false,
"mirostat_mode": 2,
"mirostat_tau": 6.5,
"mirostat_eta": 0.2
}'
Keep in mind that -tp 2
uses the first 2 visible GPUs. Adjust that value based on the number of available GPUs.
You can also use Aphrodite without setting up a REST API server, e.g. you may want to use it in your scripts.
First, import the LLM
class to handle the model-related configurations, and SamplingParams
for specifying sampler settings.
from aphrodite import LLM, SamplingParams
Then, define a single or a list of inputs for the model.
prompts = [
"What is a man? A miserable little",
"Once upon a time",
]
Specify the sampling parameters:
sampling_params = SamplingParams(temperature=1.1, min_p=0.05)
Define the model to use:
llm = LLM(model="mistralai/Mistral-7B-v0.1", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params
The llm.generate
method will use the loaded model to process the prompts. You can then print out the responses:
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Output: {generated_text!r}")