[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

BabyChouSr · 2025-01-14T21:32:12Z

Your current environment

The output of `python collect_env.py`
The output of `python collect_env.py`
(test_hf_qwen pid=17527, ip=10.130.4.26) Environment Information:
(test_hf_qwen pid=17527, ip=10.130.4.26) Collecting environment information...
(test_hf_qwen pid=17527, ip=10.130.4.26) PyTorch version: 2.6.0.dev20241126+cpu
(test_hf_qwen pid=17527, ip=10.130.4.26) Is debug build: False
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA used to build PyTorch: None
(test_hf_qwen pid=17527, ip=10.130.4.26) ROCM used to build PyTorch: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) OS: Ubuntu 22.04.4 LTS (x86_64)
(test_hf_qwen pid=17527, ip=10.130.4.26) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
(test_hf_qwen pid=17527, ip=10.130.4.26) Clang version: 14.0.0-1ubuntu1.1
(test_hf_qwen pid=17527, ip=10.130.4.26) CMake version: version 3.31.2
(test_hf_qwen pid=17527, ip=10.130.4.26) Libc version: glibc-2.35
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
(test_hf_qwen pid=17527, ip=10.130.4.26) Python platform: Linux-5.19.0-1022-gcp-x86_64-with-glibc2.35
(test_hf_qwen pid=17527, ip=10.130.4.26) Is CUDA available: False
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA runtime version: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA_MODULE_LOADING set to: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) GPU models and configuration: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) Nvidia driver version: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) cuDNN version: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) HIP runtime version: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) MIOpen runtime version: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) Is XNNPACK available: True
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU:
(test_hf_qwen pid=17527, ip=10.130.4.26) Architecture:                    x86_64
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU op-mode(s):                  32-bit, 64-bit
(test_hf_qwen pid=17527, ip=10.130.4.26) Address sizes:                   48 bits physical, 48 bits virtual
(test_hf_qwen pid=17527, ip=10.130.4.26) Byte Order:                      Little Endian
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU(s):                          240
(test_hf_qwen pid=17527, ip=10.130.4.26) On-line CPU(s) list:             0-239
(test_hf_qwen pid=17527, ip=10.130.4.26) Vendor ID:                       AuthenticAMD
(test_hf_qwen pid=17527, ip=10.130.4.26) Model name:                      AMD EPYC 7B12
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU family:                      23
(test_hf_qwen pid=17527, ip=10.130.4.26) Model:                           49
(test_hf_qwen pid=17527, ip=10.130.4.26) Thread(s) per core:              2
(test_hf_qwen pid=17527, ip=10.130.4.26) Core(s) per socket:              60
(test_hf_qwen pid=17527, ip=10.130.4.26) Socket(s):                       2
(test_hf_qwen pid=17527, ip=10.130.4.26) Stepping:                        0
(test_hf_qwen pid=17527, ip=10.130.4.26) BogoMIPS:                        4499.99
(test_hf_qwen pid=17527, ip=10.130.4.26) Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid
(test_hf_qwen pid=17527, ip=10.130.4.26) Hypervisor vendor:               KVM
(test_hf_qwen pid=17527, ip=10.130.4.26) Virtualization type:             full
(test_hf_qwen pid=17527, ip=10.130.4.26) L1d cache:                       3.8 MiB (120 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) L1i cache:                       3.8 MiB (120 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) L2 cache:                        60 MiB (120 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) L3 cache:                        480 MiB (30 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) NUMA node(s):                    2
(test_hf_qwen pid=17527, ip=10.130.4.26) NUMA node0 CPU(s):               0-59,120-179
(test_hf_qwen pid=17527, ip=10.130.4.26) NUMA node1 CPU(s):               60-119,180-239
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Itlb multihit:     Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability L1tf:              Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Mds:               Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Meltdown:          Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Mmio stale data:   Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Srbds:             Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Tsx async abort:   Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) Versions of relevant libraries:
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] mypy-extensions==1.0.0
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] numpy==1.26.4
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cublas-cu12==12.4.5.8
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cuda-cupti-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cuda-nvrtc-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cuda-runtime-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cudnn-cu12==9.1.0.70
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cufft-cu12==11.2.1.3
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-curand-cu12==10.3.5.147
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cusolver-cu12==11.6.1.9
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cusparse-cu12==12.3.1.170
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-nccl-cu12==2.21.5
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-nvjitlink-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-nvtx-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] pyzmq==26.2.0
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] torch==2.6.0.dev20241126+cpu
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] torch-xla==2.6.0+git39e67b5
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] torchvision==0.20.0.dev20241126+cpu
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] transformers==4.47.1
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] triton==3.1.0
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] numpy                     1.26.4                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] pyzmq                     26.2.0                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] torch                     2.6.0.dev20241126+cpu          pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] torch-xla                 2.6.0+git39e67b5          pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] torchvision               0.20.0.dev20241126+cpu          pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] transformers              4.47.1                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] triton                    3.1.0                    pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) ROCM Version: Could not collect
(test_hf_qwen pid=17527, ip=10.130.4.26) Neuron SDK Version: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) vLLM Version: N/A (dev)
(test_hf_qwen pid=17527, ip=10.130.4.26) vLLM Build Flags:
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
(test_hf_qwen pid=17527, ip=10.130.4.26) GPU Topology:
(test_hf_qwen pid=17527, ip=10.130.4.26) Could not collect
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) LD_LIBRARY_PATH=/home/ray/anaconda3/lib/python3.11/site-packages/cv2/../../lib64:/home/ray/anaconda3/lib/python3.11/site-packages/cv2/../../lib64::/usr/lib/x86_64-linux-gnu/:/home/ray/anaconda3/lib
(test_hf_qwen pid=17527, ip=10.130.4.26) OMP_NUM_THREADS=1
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA_VISIBLE_DEVICES=
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA_VISIBLE_DEVICES=
(test_hf_qwen pid=17527, ip=10.130.4.26) TORCHINDUCTOR_COMPILE_THREADS=1
(test_hf_qwen pid=17527, ip=10.130.4.26) TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_ray

How would you like to use vllm

I want to run tensor-parallel inference using TPUs in a ray cluster. It seems like the Ray cluster picks up the accelerator that we need but then when vllm tries to initialize the ray cluster, it doesn't know that, so it doesn't reuse the TPUs that the cluster has already picked up. I was wondering how people would implement this? Thanks!

Code:

from vllm import LLM

@ray.remote(resources={"TPU": 4, "TPU-v4-8-head": 1})
def test():
    llm = LLM(model=Qwen/Qwen2.5-7B-Instruct, enforce_eager=True, max_model_len=8192, tensor_parallel_size=4)

Error:

ValueError: Current node has no TPU available. current_node_resource={'CPU': 118.0, 'memory': 328490448486.0, 'object_store_memory': 32641751449.0, 'accelerator_type:TPU-V4': 1.0, 'node:10.130.1.88': 1.0, 'ray-marin-us-central2-worker-04980fad-tpu': 1.0}. vLLM engine cannot start without TPU. Make sure you have at least 1 TPU available in a node current_node_id='725ec1668b90a28af1ac27bb19e1f13cbf6e6430dbde48340481a93e' current_ip='10.130.1.88'.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2025-01-14T21:35:43Z

how did you install vllm?

BabyChouSr · 2025-01-14T21:40:18Z

Thanks for your quick reply!

Since #11695 wasn't merged in 0.6.6.post1 yet, I have a bit of a hack to install the requirements-tpu.txt manually in my docker. Here are the docker steps:

ARG VLLM_VERSION=0.6.6.post1

RUN sudo apt update && sudo apt install unzip -y
RUN sudo mkdir -p /opt/vllm
RUN sudo chown -R $(whoami) /opt/vllm
RUN cd /opt/vllm && curl -sLO "https://github.com/vllm-project/vllm/archive/refs/tags/v${VLLM_VERSION}.zip" && unzip v${VLLM_VERSION}.zip

WORKDIR /opt/vllm/vllm-${VLLM_VERSION}
RUN pip uninstall torch torch-xla -y
RUN sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev -y
RUN pip install -r requirements-common.txt
RUN pip install cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2
RUN pip install --no-cache-dir torch_xla[tpu]@https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev20241126-cp311-cp311-linux_x86_64.whl -f https://storage.googleapis.com/libtpu-releases/index.html
RUN pip install torchvision==0.20.0.dev20241126+cpu torch==2.6.0.dev20241126+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu
RUN pip install jax==0.4.36.dev20241122 jaxlib==0.4.36.dev20241122 -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
RUN VLLM_TARGET_DEVICE="tpu" python3 setup.py develop

ruisearch42 · 2025-01-15T06:19:37Z

Looks like Ray could recognize 'accelerator_type:TPU-V4', but somehow the 'TPU' resource count was not correctly auto detected. Maybe try debug like this: #10155 (comment)

BabyChouSr · 2025-01-15T07:16:45Z

thanks for the help @ruisearch42, and hope you've been doing well! Some extra things that might help us debug. In the ray remote function itself, I added the following:

print("TPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["TPU"]))
print("TPU_VISIBLE_CHIPS: {}".format(os.environ["TPU_VISIBLE_CHIPS"]))

For the first line, I get 0,1,2,3 as expected. For the second line, I get TPU_VISIBLE_CHIPS in not a environment variable as a KeyError.

BabyChouSr · 2025-01-15T08:09:28Z

Another update is, I spun up a new v4-8 instance (without Ray, I did this manually). It seems like running vllm serve Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 4 works in this case. So, it seems like this is not working because the instance is trying to load the existing Ray cluster, and somehow it is not picking up the TPUs correctly.

BabyChouSr added the usage How to use vllm label Jan 14, 2025

ruisearch42 added the ray anything related with ray label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

BabyChouSr commented Jan 14, 2025 •

edited

Loading

robertgshaw2-redhat commented Jan 14, 2025

BabyChouSr commented Jan 14, 2025 •

edited

Loading

ruisearch42 commented Jan 15, 2025 •

edited

Loading

BabyChouSr commented Jan 15, 2025

BabyChouSr commented Jan 15, 2025

[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

Comments

BabyChouSr commented Jan 14, 2025 • edited Loading

Your current environment

How would you like to use vllm

Before submitting a new issue...

robertgshaw2-redhat commented Jan 14, 2025

BabyChouSr commented Jan 14, 2025 • edited Loading

ruisearch42 commented Jan 15, 2025 • edited Loading

BabyChouSr commented Jan 15, 2025

BabyChouSr commented Jan 15, 2025

BabyChouSr commented Jan 14, 2025 •

edited

Loading

BabyChouSr commented Jan 14, 2025 •

edited

Loading

ruisearch42 commented Jan 15, 2025 •

edited

Loading