-
-
Notifications
You must be signed in to change notification settings - Fork 140
6. FAQ & Issues
Aphrodite by default uses 90% of your GPU(s) VRAM. To limit this behaviour, set -gmu 0.5
or gpu_memory_utilization=0.5
to use 50% instead.
If you're using a quantized model (GPTQ, AWQ, etc), make sure you're correctly specifying this when launching the model via the -q
flag, e.g. -q gptq
or quantization=gptq
.
If you're running a Mistral or Mixtral-based model, try limiting your model's max context length (as it launches with 32k) using the --max-model-len
command; e.g. --max-model-len 8192
or max_model_len=8192
.
In the event that this issue keeps persisting, try increasing or decreasing the --gpu-memory-utilization
values.
You can also use the FP8 KV cache option with --kv-cache-dtype fp8
to save memory even further. Note that this does not require an H100/4090 GPU.
Aphrodite supports automatic RoPE extension. Simply specify your desired context length with the --max-model-len
option.
If you install via the pip package, then yes. If building from source, CUDA 11.8 or lower is required at the moment.
Try increasing --max-num-seqs
- the default value is 256.
Windows users will have to rely on WSL for now. Native Windows support is currently being worked on.
Due to limitations from the upstream ROCm fork of Flash Attention, only a handful of datacenter-grade AMD GPUs are supported. Work is being done to support consumer AMD hardware.
There are multiple contributing factors:
- Initializing a distributed environment involves setting up communication between the GPUs, which can take some time, depending on how the GPUs communicate with each other (NVLink, PCIe, etc).
- Aphrodite Engine profiles the memory usage when initializing the KV cache. This involves running a forward pass with dummy inputs to profile the memory usage of the model, which can be significantly time-consuming, especially so in distributed environments.
- Allocating memory blocks on the GPU and CPU can take a significant amount of time, especially when the number of blocks is large.
The increase in init time isn't necessarily linear with the number of GPUs because each additional GPU adds overhead for communication and synchronization between GPUs. The memory profiling and cache initialization steps are performed for each GPU, which can lead to an increase in init time.
You might be seeing an error similar to this:
The detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.8). Please make sure to use the same CUDA versions.
This is normally due to your environment referring to the global installation of CUDA and not the one in your current env. Run which nvcc
and note down the output. For example, if your output is /home/anon/miniconda3/envs/aphrodite/bin/nvcc
, run this command:
Warning
Don't forget to replace /home/anon
with the actual path from the which nvcc
output!
export CUDA_HOME=/home/anon/miniconda3/envs/aphrodite
Then run the installation command again.
On some GPU configurations, you may see this error:
ncclInternalError: Internal check failed.
Last error:
No NVML device handle. Skipping nvlink detection.
This happens if you're doing tensor parallelism (multi-GPU) on NVLinked NVIDIA GPUs and they don't support P2P. Please run this command before running the server:
export NCCL_P2P_DISABLE=1
Alternatively, you can prepend NCCL_P2P_DISABLE=1
to your server launch command.
This is likely due to Docker container port forwarding issues. We're trying to have the container scrape from the host, which is an unusual use-case and not easily fixed. An easy solution is to use cloudflared to forward your local port to a public URL.
Download the the binary from here and run the following:
chmod +x cloudflared-linux-amd64
./cloudflared-linux-amd64 tunnel --url localhost:2242
Then, edit prometheus.yaml
, and make the following changes:
global:
scrape_interval: 1s
evaluation_interval: 1s
scrape_configs:
- job_name: aphrodite-engine
metrics_path: /metrics
+ scheme: https
static_configs:
- targets:
- - 'host.docker.internal:2242'
+ - 'your_cloudflare_url.trycloudflare.com'
Replace the URL appropriately, then launch the container with docker compose up
.