-
-
Notifications
You must be signed in to change notification settings - Fork 140
3. Engine Options
Below are the full list of available options to use with the Aphrodite Engine API server.
--model <model_name_or_path>
Name or path of the HuggingFace model to use, example:
--tokenizer <tokenizer_name_or_path>
Name or path of the HuggingFace tokenizer, defaults to the same as --model
.
--revision <revision>
The specific model branch to use. It can be a branch, a tag, or a commit ID. If unspecified, the main branch will be used.
--tokenizer-mode {auto,slow}
The tokenizer mode to use. "auto" will use the fast tokenizer if available.
--trust-remote-code
Trust remote code from certain models.
--download-dir <directory>
The directory to download and load the weights, defauts to the default HF cache (~/.cache
)
--load-format {auto,pt,safetensors,npcache,dummy}
The format of the model weights to load. Will also filter the download from HuggingFace if they provide multiple formats for the model.
- "auto" will load in safetensors by default, and will fallback to pytorch bin.
- "pt" will load in pytorch bin format.
- "safetensors" will load in safetensors format.
- "npcache" will load in pytorch format and store a numpy cache to speed up the loading.
- "dummy" will init the weights with random values - for testing purposes.
--dtype {auto,half,float16,bfloat16,float,float32}
The data type to use for loading the model.
- "auto" will use FP16 for FP32/FP16 models, and BF16 for BF16 models.
- "half" = FP16.
- "float16" = FP16.
- "bfloat16" will load the weights in BF16.
- "float" = FP32
- "float32" = FP32
--max-model-len <length>
Model context length. If unspecified, will be automatically set to the model's default. If set to a higher number, will automatically adjust RoPE scaling to increase model's max context length.
--worker-use-ray
Use Ray for distributed serving. It's on by default if using more than 1 GPU. Useful for 1 GPU scenarios too for a slight increase in throughput.
--pipeline-parallel-size (-pp) <size>
Number of pipeline stages. Currently unsupported.
--tensor-parallel-size (-tp) <size>
The number of tensor parallel replicas. Set this to the number of GPUs you want to use.
--max-parallel-loading-workers <workers>
Load model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and larger models.
--block-size {8,16,32}
Token block size for contiguous chunks of tokens. Larger blocks may reduce the overhead of managing many small blocks but could lead to inefficiencies if the granularity of the cached activations is too coarse. Smaller blocks provide finer granularity, which can be more efficient if the model frequently accesses small, non-contiguous chunks of tokens.
--seed <seed>
The initial random seed to use.
--swap-space <size>
CPU swap space size (GiB) per GPU. This space is utilized to temporarily store the states of requests when they can't be held in GPU memory, particularly useful when requesting best_of > 1
.
--gpu-memory-utilization (-gmu) <fraction>
The fraction of GPU memory to be used for the model, which can range from 0 to 1. For example, a value of 0.5 would reserve 50% of the GPU for Aphrodite. If unspecified, will be set to 0.9 (90% of the GPU memory).
--max-num-batched-tokens <tokens>
Maximum number of tokens that can be processed in a single batch during model execution. It helps control the memory usage and the size of the computation for each batch. If the number of tokens from the incoming requests exceeds this limit, the scheduler will create multiple batches to process the requests.
--max-num-seqs <sequences>
Maximum number of requests to be processed in parallel per iteration.
--max-paddings <paddings>
Specify the maximum number of padding tokens that can be added to a batch. Padding is used to ensure that all sequences in a batch have the same length, which is necessary for efficient parallel processing on GPUs. This limit prevents excessive padding, which can waste computation and memory.
--disable-log-stats
Disable logging statistics.
--disable-log-requests
Disable the request logging.
--max-log-len <length>
Maximum length of the logs. By default it's set to infinite. Set to 0 to disable logging the prompts.
--quantization (-q) {awq,gguf,gptq,quip,squeezellm,None}
Method used to quantize the weights. Pair with --dtype float16
.
--enforce-eager
Use eager-mode PyTorch and disable CUDA graph. If False, will use eager mode and CUDA graph in hybrid for maximum performance and flexibility. CUDA graphs are a feature in CUDA that allows for capturing a sequence of operations to be replayed later. This can reduce the overhead of launching many small kernels and improve performance, especially for repetitive tasks.
--max-context-len-to-capture <length>
Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. CUDA graphs use more memory, so this option is important for managing the trade-off between the performance gains from using CUDA graphs and the memory overhead associated with capturing larger graphs.
-
--disable-custom-all-reduce
Disable the custom all-reduce kernels for multi-gpu. Enabled by default, disable if experiencing instabilities.
--kv-cache-dtype {fp8_e5m2,None}
The data type to use for the KV cache. Set to fp8_e5m2
for memory savings with a slight increase to throughput.
--served-model-name {name}
Pretty name to use in the API. If not specified, the model name will be the same as the --model
arg.
--max-length <length>
The maximum length to report to the API. Does not affect the engine. Kobold API only.
--api-keys [key1,key2]
Authorization API key for the OpenAI server. Not applicable to Kobold API. Optional.
--chat-template <path>
Path to the jinja file with the chat template. By default, it will use the model's specified template (if available). If not found, will disable Chat Completions. OpenAI API only.
--response-role <role
The role name to return for OpenAI chat completions.
--ssl-keyfile <path>
Path to the SSL key file.
--ssl-certfile <path>
Path to the SSL certificate file.
n
The number of output sequences to return for a prompt.
best_of
Number of output sequences to generate from the prompt. From these best_of
sequences, the top n
sequences are returned. By default, it's set equal to n
. If use_beam_search
is True, it'll be treated as beam width.
presence_penalty
Penalize new tokens based on whether they appear in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.
frequency_penalty
Penalize new tokens based on their frequency in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.
repetition_penalty
Penalize new tokens based on whether they appear in the prompt and the generated text so far. Values higher than 1 encourage the model to use new tokens, while lower than 1 encourage the model to repeat tokens. Disabled: 1.
temperature
Control the randomness of the output. Lower values make the model more deterministic, while higher values make the model more random. Disabled: 1.
dynatemp_range
Allows the user to use a Dynamic Temperature that scales based on the entropy of token probabilities (normalized by the maximum possible entropy for a distribution so it scales well across different K values). Controls the variability of token probabilities. Dynamic Temperature takes a minimum and maximum temperature values; minimum temperature will be calculated as temperature - dynatemp_range
, and maximum temperature as temperature + dynatemp_range
. Disabled: 0.
dynatemp_exponent
The exponent value for dynamic temperature. Defaults to 1. Higher values will trend towards lower temperatures, lower values will trend toward higher temperatures.
-
smoothing_factor
The smoothing factor to use for Quadratic Sampling.
top_p
Control the cumulative probability of the top tokens to consider. Disabled: 1.
top_k
Control the number of top tokens to consider. Disabled: -1.
top_a
Controls the threshold probability for tokens, reducing randomness when AI certainty is high. Does not significantly affect output creativity. Disabled: 0.
min_p
Controls the minimum probability for a token to be considered, relative to the probability of the most likely token. Disable: 0.
tfs
Tail-Free Sampling. Eliminates low probability tokens after identifying a plateau in sorted token probabilities. It minimally affects the creativity of the output and is best used for longer texts. Disabled: 1.
eta_cutoff
Used in Eta sampling, it adapts the cutoff threshold based on the entropy of the token probabilities, optimizing token selection. Value is in units of 1e-4. Disabled: 0.
epsilon_cutoff
Used in Epsilon sampling, it sets a simple probability threshold for token selection. Value is in units of 1e-4. Disabled: 0.
typical_p
This method regulates the information content in the generated text by sorting tokens based on the sum of entropy and the natural logarithm of token probability. It has a strong effect on output content but still maintains creativity even at low settings. Disabled: 1.
mirostat_mode
The mirostat mode to use. Only 2
is currently supported. Mirostat is an adaptive decoding algorithm that generates text with a predetermined perplexity value, providing control over repetitions and thus ensuring high-quality, coherent, and fluent text. Disabled: 0.
mirostat_tau
The target "surprise" value that Mirostat works towards. Range is in 0 to infinity.
mirostat_eta
The learning rate at which Mirostat updates its internal suprise value. Range is from 0 to infinity.
use_beam_search
Whether to use beam search instead of normal sampling.
length_penalty
Penalize sequences based on their length. Used in beam search.
early_stopping
Controls the stopping condition for beam search. It accepts the following values: True
, where the generation stops as soon as there are best_of
complete candidates; False
, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never"
, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
stop
List of strings (words) that stop the generation when they are generated. The returned output will not contain the stop strings.
stop_token_ids
List of token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens (e.g. EOS).
include_stop_str_in_output
Whether to include the stop strings in output text. Default: False.
ignore_eos
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
max_tokens
The maximum number of tokens to generate per output sequence.
logprobs
Number of log probabilities to return per output token. Note that the implementation follows the OpenAI API: The return result includes the log probabilities on the logprobs
most likely tokens, as well as the chosen tokens. The API will always return the log probability of the sampled token, so there may be up to logprobs+1
elements in the response.
prompt_logprobs
Number of log probabilities to return per prompt token.
custom_token_bans
List of token IDs to ban from being generated.
skip_special_tokens
Whether to skip special tokens in the output. Default: True.
spaces_between_special_tokens
Whether to add spaces between special tokens in the output. Defaults: True.
logits_processors
List of LogitsProcessors to change the probability of token prediction at runtime. Aliased to logit_bias
in the API request body.