From 1953248ae15601c1010ddd53f97a71964c3b0d28 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Tue, 31 Dec 2024 04:24:41 +0000 Subject: [PATCH 1/4] Reorganize Getting Started section Signed-off-by: DarkLight1337 --- docs/source/design/arch_overview.md | 3 +-- docs/source/design/multiprocessing.md | 2 +- docs/source/{usage => getting_started}/faq.md | 0 .../cpu-arm.md} | 0 .../cpu-x86.md} | 6 +++--- .../gpu-cuda.md} | 4 ++-- .../gpu-rocm.md} | 0 .../hpu-gaudi.md} | 2 ++ .../getting_started/installation/index.md | 17 +++++++++++++++++ .../neuron.md} | 0 .../openvino.md} | 2 +- .../tpu.md} | 0 .../xpu.md} | 0 docs/source/getting_started/quickstart.md | 2 +- .../{debugging.md => troubleshooting.md} | 11 ++++++----- docs/source/index.md | 14 +++----------- docs/source/models/generative_models.md | 2 +- docs/source/models/pooling_models.md | 2 +- docs/source/serving/distributed_serving.md | 2 +- docs/source/usage/spec_decode.md | 4 ++-- docs/source/usage/structured_outputs.md | 2 +- vllm/utils.py | 2 +- 22 files changed, 44 insertions(+), 33 deletions(-) rename docs/source/{usage => getting_started}/faq.md (100%) rename docs/source/getting_started/{arm-installation.md => installation/cpu-arm.md} (100%) rename docs/source/getting_started/{cpu-installation.md => installation/cpu-x86.md} (95%) rename docs/source/getting_started/{installation.md => installation/gpu-cuda.md} (99%) rename docs/source/getting_started/{amd-installation.md => installation/gpu-rocm.md} (100%) rename docs/source/getting_started/{gaudi-installation.md => installation/hpu-gaudi.md} (99%) create mode 100644 docs/source/getting_started/installation/index.md rename docs/source/getting_started/{neuron-installation.md => installation/neuron.md} (100%) rename docs/source/getting_started/{openvino-installation.md => installation/openvino.md} (91%) rename docs/source/getting_started/{tpu-installation.md => installation/tpu.md} (100%) rename docs/source/getting_started/{xpu-installation.md => installation/xpu.md} (100%) rename docs/source/getting_started/{debugging.md => troubleshooting.md} (94%) diff --git a/docs/source/design/arch_overview.md b/docs/source/design/arch_overview.md index 475a3e5fa9ddc..2f1280c047672 100644 --- a/docs/source/design/arch_overview.md +++ b/docs/source/design/arch_overview.md @@ -77,8 +77,7 @@ python -m vllm.entrypoints.openai.api_server --model That code can be found in . -More details on the API server can be found in the {doc}`OpenAI Compatible -Server ` document. +More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document. ## LLM Engine diff --git a/docs/source/design/multiprocessing.md b/docs/source/design/multiprocessing.md index 34564413b34f6..da87638e5b743 100644 --- a/docs/source/design/multiprocessing.md +++ b/docs/source/design/multiprocessing.md @@ -2,7 +2,7 @@ ## Debugging -Please see the [Debugging Tips](#debugging-python-multiprocessing) +Please see the [Troubleshooting](#troubleshooting-python-multiprocessing) page for information on known issues and how to solve them. ## Introduction diff --git a/docs/source/usage/faq.md b/docs/source/getting_started/faq.md similarity index 100% rename from docs/source/usage/faq.md rename to docs/source/getting_started/faq.md diff --git a/docs/source/getting_started/arm-installation.md b/docs/source/getting_started/installation/cpu-arm.md similarity index 100% rename from docs/source/getting_started/arm-installation.md rename to docs/source/getting_started/installation/cpu-arm.md diff --git a/docs/source/getting_started/cpu-installation.md b/docs/source/getting_started/installation/cpu-x86.md similarity index 95% rename from docs/source/getting_started/cpu-installation.md rename to docs/source/getting_started/installation/cpu-x86.md index c3d3f715ed804..e5574ee33c56a 100644 --- a/docs/source/getting_started/cpu-installation.md +++ b/docs/source/getting_started/installation/cpu-x86.md @@ -1,6 +1,6 @@ -(installation-cpu)= +(installation-x86)= -# Installation with CPU +# Installation with x86 CPU vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features: @@ -151,4 +151,4 @@ $ python examples/offline_inference.py $ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp ``` - - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md). + - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md). diff --git a/docs/source/getting_started/installation.md b/docs/source/getting_started/installation/gpu-cuda.md similarity index 99% rename from docs/source/getting_started/installation.md rename to docs/source/getting_started/installation/gpu-cuda.md index 996fb346f43d4..9e65fb9cb0cd9 100644 --- a/docs/source/getting_started/installation.md +++ b/docs/source/getting_started/installation/gpu-cuda.md @@ -1,6 +1,6 @@ -(installation)= +(installation-cuda)= -# Installation +# Installation with CUDA vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries. diff --git a/docs/source/getting_started/amd-installation.md b/docs/source/getting_started/installation/gpu-rocm.md similarity index 100% rename from docs/source/getting_started/amd-installation.md rename to docs/source/getting_started/installation/gpu-rocm.md diff --git a/docs/source/getting_started/gaudi-installation.md b/docs/source/getting_started/installation/hpu-gaudi.md similarity index 99% rename from docs/source/getting_started/gaudi-installation.md rename to docs/source/getting_started/installation/hpu-gaudi.md index 1f2ee62860dec..37d1621170152 100644 --- a/docs/source/getting_started/gaudi-installation.md +++ b/docs/source/getting_started/installation/hpu-gaudi.md @@ -1,3 +1,5 @@ +(installation-gaudi)= + # Installation with Intel® Gaudi® AI Accelerators This README provides instructions on running vLLM with Intel Gaudi devices. diff --git a/docs/source/getting_started/installation/index.md b/docs/source/getting_started/installation/index.md new file mode 100644 index 0000000000000..760a9594a971a --- /dev/null +++ b/docs/source/getting_started/installation/index.md @@ -0,0 +1,17 @@ +(installation-index)= + +# Installation + +```{toctree} +:maxdepth: 1 + +gpu-cuda +gpu-rocm +cpu-x86 +cpu-arm +hpu-gaudi +tpu +xpu +openvino +neuron +``` diff --git a/docs/source/getting_started/neuron-installation.md b/docs/source/getting_started/installation/neuron.md similarity index 100% rename from docs/source/getting_started/neuron-installation.md rename to docs/source/getting_started/installation/neuron.md diff --git a/docs/source/getting_started/openvino-installation.md b/docs/source/getting_started/installation/openvino.md similarity index 91% rename from docs/source/getting_started/openvino-installation.md rename to docs/source/getting_started/installation/openvino.md index 8b43c0a90447f..687cfc98f0d6c 100644 --- a/docs/source/getting_started/openvino-installation.md +++ b/docs/source/getting_started/installation/openvino.md @@ -2,7 +2,7 @@ # Installation with OpenVINO -vLLM powered by OpenVINO supports all LLM models from {doc}`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features: +vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features: - Prefix caching (`--enable-prefix-caching`) - Chunked prefill (`--enable-chunked-prefill`) diff --git a/docs/source/getting_started/tpu-installation.md b/docs/source/getting_started/installation/tpu.md similarity index 100% rename from docs/source/getting_started/tpu-installation.md rename to docs/source/getting_started/installation/tpu.md diff --git a/docs/source/getting_started/xpu-installation.md b/docs/source/getting_started/installation/xpu.md similarity index 100% rename from docs/source/getting_started/xpu-installation.md rename to docs/source/getting_started/installation/xpu.md diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md index 9c8b7e4f592c9..ff216f8af30f9 100644 --- a/docs/source/getting_started/quickstart.md +++ b/docs/source/getting_started/quickstart.md @@ -23,7 +23,7 @@ $ conda activate myenv $ pip install vllm ``` -Please refer to the {ref}`installation documentation ` for more details on installing vLLM. +Please refer to the [installation documentation](#installation-index) for more details on installing vLLM. (offline-batched-inference)= diff --git a/docs/source/getting_started/debugging.md b/docs/source/getting_started/troubleshooting.md similarity index 94% rename from docs/source/getting_started/debugging.md rename to docs/source/getting_started/troubleshooting.md index 19eb699572a08..5a0310da0f2cb 100644 --- a/docs/source/getting_started/debugging.md +++ b/docs/source/getting_started/troubleshooting.md @@ -1,8 +1,8 @@ -(debugging)= +(troubleshooting)= -# Debugging Tips +# Troubleshooting -This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. +This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. ```{note} Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated. @@ -47,6 +47,7 @@ You might also need to set `export NCCL_SOCKET_IFNAME=` If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph. To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error. +(troubleshooting-incorrect-hardware-driver)= ## Incorrect hardware/driver If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. @@ -139,7 +140,7 @@ A multi-node environment is more complicated than a single-node one. If you see Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. ``` -(debugging-python-multiprocessing)= +(troubleshooting-python-multiprocessing)= ## Python multiprocessing ### `RuntimeError` Exception @@ -150,7 +151,7 @@ If you have seen a warning in your logs like this: WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See - https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing + https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. ``` diff --git a/docs/source/index.md b/docs/source/index.md index 34f9c4caebe6f..12a985d48022e 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -58,18 +58,11 @@ For more information, check out the following: :caption: Getting Started :maxdepth: 1 -getting_started/installation -getting_started/amd-installation -getting_started/openvino-installation -getting_started/cpu-installation -getting_started/gaudi-installation -getting_started/arm-installation -getting_started/neuron-installation -getting_started/tpu-installation -getting_started/xpu-installation +getting_started/installation/index getting_started/quickstart -getting_started/debugging +getting_started/troubleshooting getting_started/examples/examples_index +getting_started/faq ``` ```{toctree} @@ -110,7 +103,6 @@ usage/structured_outputs usage/spec_decode usage/compatibility_matrix usage/performance -usage/faq usage/engine_args usage/env_vars usage/usage_stats diff --git a/docs/source/models/generative_models.md b/docs/source/models/generative_models.md index 35e0302b86619..383299d61b5dd 100644 --- a/docs/source/models/generative_models.md +++ b/docs/source/models/generative_models.md @@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ## Online Inference -Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: +Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs: - [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text. - [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template. diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md index 76c96c9edcc5d..12ded68eb30b5 100644 --- a/docs/source/models/pooling_models.md +++ b/docs/source/models/pooling_models.md @@ -106,7 +106,7 @@ A code example can be found here: for more information. +After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See for more information. ``` ```{warning} diff --git a/docs/source/usage/spec_decode.md b/docs/source/usage/spec_decode.md index 8302da81b6173..8c52c97a41e48 100644 --- a/docs/source/usage/spec_decode.md +++ b/docs/source/usage/spec_decode.md @@ -182,7 +182,7 @@ speculative decoding, breaking down the guarantees into three key areas: 3. **vLLM Logprob Stability** \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the same request across runs. For more details, see the FAQ section - titled *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs `. + titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). **Conclusion** @@ -195,7 +195,7 @@ can occur due to following factors: **Mitigation Strategies** -For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs `. +For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). ## Resources for vLLM contributors diff --git a/docs/source/usage/structured_outputs.md b/docs/source/usage/structured_outputs.md index 7292012e36a26..26c09bb0d8a0c 100644 --- a/docs/source/usage/structured_outputs.md +++ b/docs/source/usage/structured_outputs.md @@ -18,7 +18,7 @@ The following parameters are supported, which must be added as extra parameters: - `guided_whitespace_pattern`: used to override the default whitespace pattern for guided json decoding. - `guided_decoding_backend`: used to select the guided decoding backend to use. -You can see the complete list of supported parameters on the [OpenAI Compatible Server](../serving/openai_compatible_server.md) page. +You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server)page. Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one: diff --git a/vllm/utils.py b/vllm/utils.py index 8ef07d2c326a3..aadeddabf8b55 100644 --- a/vllm/utils.py +++ b/vllm/utils.py @@ -1938,7 +1938,7 @@ def _check_multiproc_method(): "the `spawn` multiprocessing start method. Setting " "VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. " "See https://docs.vllm.ai/en/latest/getting_started/" - "debugging.html#python-multiprocessing " + "troubleshooting.html#python-multiprocessing " "for more information.") os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" From 407f5b28c7866ad13c18697c64b16d5b9a1d22d5 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Tue, 31 Dec 2024 05:45:09 +0000 Subject: [PATCH 2/4] Update headers Signed-off-by: DarkLight1337 --- docs/source/getting_started/installation/cpu-arm.md | 2 +- docs/source/getting_started/installation/cpu-x86.md | 2 +- docs/source/getting_started/installation/gpu-cuda.md | 2 +- docs/source/getting_started/installation/gpu-rocm.md | 2 +- docs/source/getting_started/installation/hpu-gaudi.md | 2 +- docs/source/getting_started/installation/index.md | 2 ++ docs/source/getting_started/installation/neuron.md | 2 +- docs/source/getting_started/installation/openvino.md | 2 +- docs/source/getting_started/installation/tpu.md | 2 +- docs/source/getting_started/installation/xpu.md | 2 +- 10 files changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/source/getting_started/installation/cpu-arm.md b/docs/source/getting_started/installation/cpu-arm.md index 799b597b3ad5d..a46e2c010600d 100644 --- a/docs/source/getting_started/installation/cpu-arm.md +++ b/docs/source/getting_started/installation/cpu-arm.md @@ -2,7 +2,7 @@ # Installation for ARM CPUs -vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the x86 platform documentation covering: +vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the [x86 CPU documentation](#installation-x86) covering: - CPU backend inference capabilities - Relevant runtime environment variables diff --git a/docs/source/getting_started/installation/cpu-x86.md b/docs/source/getting_started/installation/cpu-x86.md index e5574ee33c56a..bbb2d1872ef39 100644 --- a/docs/source/getting_started/installation/cpu-x86.md +++ b/docs/source/getting_started/installation/cpu-x86.md @@ -1,6 +1,6 @@ (installation-x86)= -# Installation with x86 CPU +# Installation for x86 CPUs vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features: diff --git a/docs/source/getting_started/installation/gpu-cuda.md b/docs/source/getting_started/installation/gpu-cuda.md index 9e65fb9cb0cd9..7ea10bb8b59ff 100644 --- a/docs/source/getting_started/installation/gpu-cuda.md +++ b/docs/source/getting_started/installation/gpu-cuda.md @@ -1,6 +1,6 @@ (installation-cuda)= -# Installation with CUDA +# Installation for CUDA vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries. diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu-rocm.md index 6d01efbbf8828..796911d7305a6 100644 --- a/docs/source/getting_started/installation/gpu-rocm.md +++ b/docs/source/getting_started/installation/gpu-rocm.md @@ -1,6 +1,6 @@ (installation-rocm)= -# Installation with ROCm +# Installation for ROCm vLLM supports AMD GPUs with ROCm 6.2. diff --git a/docs/source/getting_started/installation/hpu-gaudi.md b/docs/source/getting_started/installation/hpu-gaudi.md index 37d1621170152..94de169f51a73 100644 --- a/docs/source/getting_started/installation/hpu-gaudi.md +++ b/docs/source/getting_started/installation/hpu-gaudi.md @@ -1,6 +1,6 @@ (installation-gaudi)= -# Installation with Intel® Gaudi® AI Accelerators +# Installation for Intel® Gaudi® This README provides instructions on running vLLM with Intel Gaudi devices. diff --git a/docs/source/getting_started/installation/index.md b/docs/source/getting_started/installation/index.md index 760a9594a971a..83de1aff409b2 100644 --- a/docs/source/getting_started/installation/index.md +++ b/docs/source/getting_started/installation/index.md @@ -2,6 +2,8 @@ # Installation +vLLM supports the following hardware platforms: + ```{toctree} :maxdepth: 1 diff --git a/docs/source/getting_started/installation/neuron.md b/docs/source/getting_started/installation/neuron.md index baaeeb9f53a10..431f90537f543 100644 --- a/docs/source/getting_started/installation/neuron.md +++ b/docs/source/getting_started/installation/neuron.md @@ -1,6 +1,6 @@ (installation-neuron)= -# Installation with Neuron +# Installation for Neuron vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching. Paged Attention and Chunked Prefill are currently in development and will be available soon. diff --git a/docs/source/getting_started/installation/openvino.md b/docs/source/getting_started/installation/openvino.md index 687cfc98f0d6c..60f95fd1c4250 100644 --- a/docs/source/getting_started/installation/openvino.md +++ b/docs/source/getting_started/installation/openvino.md @@ -1,6 +1,6 @@ (installation-openvino)= -# Installation with OpenVINO +# Installation for OpenVINO vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features: diff --git a/docs/source/getting_started/installation/tpu.md b/docs/source/getting_started/installation/tpu.md index 4d3ac541c90ce..bc93c44fead30 100644 --- a/docs/source/getting_started/installation/tpu.md +++ b/docs/source/getting_started/installation/tpu.md @@ -1,6 +1,6 @@ (installation-tpu)= -# Installation with TPU +# Installation for TPUs Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs diff --git a/docs/source/getting_started/installation/xpu.md b/docs/source/getting_started/installation/xpu.md index 9554ae4b7fb44..be4e3b9bd1bc5 100644 --- a/docs/source/getting_started/installation/xpu.md +++ b/docs/source/getting_started/installation/xpu.md @@ -1,6 +1,6 @@ (installation-xpu)= -# Installation with XPU +# Installation for XPUs vLLM initially supports basic model inferencing and serving on Intel GPU platform. From b8a7b04d77c2b3b7a30db0374dfdeb282c0cbe4d Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Tue, 31 Dec 2024 05:45:29 +0000 Subject: [PATCH 3/4] Reorder navbar Signed-off-by: DarkLight1337 --- docs/source/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/index.md b/docs/source/index.md index 12a985d48022e..aa37fffec89ef 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -60,8 +60,8 @@ For more information, check out the following: getting_started/installation/index getting_started/quickstart -getting_started/troubleshooting getting_started/examples/examples_index +getting_started/troubleshooting getting_started/faq ``` From 7686f35ba538bc8e69a0c5b80c813cdb87a743b8 Mon Sep 17 00:00:00 2001 From: DarkLight1337 Date: Tue, 31 Dec 2024 05:47:30 +0000 Subject: [PATCH 4/4] Update link Signed-off-by: DarkLight1337 --- docs/source/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/index.md b/docs/source/index.md index aa37fffec89ef..f390474978790 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -50,7 +50,7 @@ For more information, check out the following: - [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention) - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023) - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al. -- {ref}`vLLM Meetups `. +- [vLLM Meetups](#meetups) ## Documentation