-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: example/openai_chat_completion_client_with_tools.py not working #11903
Comments
You should use the tool calling chat template in |
You're right, thanks! Still the same Internal Server Error 500, though. `INFO 01-09 08:49:54 api_server.py:640] Using supplied chat template: ... DEBUG 01-09 08:50:54 client.py:165] Heartbeat successful. ` |
Do you get any error logs when the internal server error occurs? If not, try passing |
Sadly not:
DEBUG 01-10 00:24:24 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. I don't know if the uvicorn logging might tell something useful, I don't know how to get that log output. Do you have a server with a GPU running Docker to reproduce this? If you have a HuggingFace account (and token), you should be able to reproduce it with the docker command mentioned above and the official example script. |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
vLLM:
$ docker container stop vllm; docker container rm vllm; docker run --name vllm --runtime nvidia -e "VLLM_LOGGING_LEVEL=DEBUG" -e "NVIDIA_VISIBLE_DEVICES=GPU-8cba8394-b5d6-1e92-6658-bb6efc08abff,GPU-c05c3905-fdd9-34a3-f6c0-1437beb91c7d" -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=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" --ipc=host -p 8000:8000 vllm/vllm-openai --gpu-memory-utilization 0.95 --model cstr/llama3.1-8b-spaetzle-v90 --served-model-name llama3.1-8b-spaetzle-v90 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes
vllm
vllm
INFO 01-09 07:32:00 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-09 07:32:00 api_server.py:713] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='hermes', tool_parser_plugin='', model='cstr/llama3.1-8b-spaetzle-v90', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8b-spaetzle-v90'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
DEBUG 01-09 07:32:00 init.py:60] No plugins found.
DEBUG 01-09 07:32:00 api_server.py:180] Multiprocessing frontend to use ipc:///tmp/29b13bff-e5f0-4028-8a35-f5fce0df81c1 for IPC Path.
INFO 01-09 07:32:00 api_server.py:199] Started engine process with PID 76
DEBUG 01-09 07:32:08 init.py:60] No plugins found.
INFO 01-09 07:32:12 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
INFO 01-09 07:32:12 config.py:1310] Defaulting to use mp for distributed inference
WARNING 01-09 07:32:12 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 01-09 07:32:12 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 01-09 07:32:18 config.py:510] This model supports multiple tasks: {'generate', 'reward', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
INFO 01-09 07:32:18 config.py:1310] Defaulting to use mp for distributed inference
WARNING 01-09 07:32:18 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 01-09 07:32:18 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 01-09 07:32:18 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cstr/llama3.1-8b-spaetzle-v90', speculative_config=None, tokenizer='cstr/llama3.1-8b-spaetzle-v90', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8b-spaetzle-v90, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 01-09 07:32:19 multiproc_worker_utils.py:312] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-09 07:32:19 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=348) INFO 01-09 07:32:21 selector.py:120] Using Flash Attention backend.
(VllmWorkerProcess pid=348) INFO 01-09 07:32:21 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
INFO 01-09 07:32:21 selector.py:120] Using Flash Attention backend.
DEBUG 01-09 07:32:22 parallel_state.py:959] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:38131 backend=nccl
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:22 parallel_state.py:959] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:38131 backend=nccl
INFO 01-09 07:32:22 utils.py:918] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=348) INFO 01-09 07:32:22 utils.py:918] Found nccl from library libnccl.so.2
INFO 01-09 07:32:22 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=348) INFO 01-09 07:32:22 pynccl.py:69] vLLM is using nccl==2.21.5
DEBUG 01-09 07:32:23 client.py:186] Waiting for output from MQLLMEngine.
INFO 01-09 07:32:23 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
DEBUG 01-09 07:32:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:33 client.py:186] Waiting for output from MQLLMEngine.
INFO 01-09 07:32:39 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=348) INFO 01-09 07:32:39 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
DEBUG 01-09 07:32:40 shm_broadcast.py:215] Binding to tcp://127.0.0.1:49403
INFO 01-09 07:32:40 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_bd2e25f8'), local_subscribe_port=49403, remote_subscribe_port=None)
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 shm_broadcast.py:279] Connecting to tcp://127.0.0.1:49403
INFO 01-09 07:32:40 model_runner.py:1094] Starting to load model cstr/llama3.1-8b-spaetzle-v90...
(VllmWorkerProcess pid=348) INFO 01-09 07:32:40 model_runner.py:1094] Starting to load model cstr/llama3.1-8b-spaetzle-v90...
DEBUG 01-09 07:32:40 decorators.py:105] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 decorators.py:105] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 01-09 07:32:40 config.py:3285] enabled custom ops: Counter({'rms_norm': 65, 'silu_and_mul': 32, 'rotary_embedding': 1})
DEBUG 01-09 07:32:40 config.py:3287] disabled custom ops: Counter()
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 config.py:3285] enabled custom ops: Counter({'rms_norm': 65, 'silu_and_mul': 32, 'rotary_embedding': 1})
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 config.py:3287] disabled custom ops: Counter()
INFO 01-09 07:32:40 weight_utils.py:251] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
(VllmWorkerProcess pid=348) INFO 01-09 07:32:40 weight_utils.py:251] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:00<00:05, 2.77it/s]
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:00<00:05, 2.64it/s]
Loading safetensors checkpoint shards: 18% Completed | 3/17 [00:01<00:05, 2.41it/s]
Loading safetensors checkpoint shards: 24% Completed | 4/17 [00:01<00:05, 2.34it/s]
Loading safetensors checkpoint shards: 29% Completed | 5/17 [00:02<00:05, 2.33it/s]
Loading safetensors checkpoint shards: 35% Completed | 6/17 [00:02<00:04, 2.44it/s]
DEBUG 01-09 07:32:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:43 client.py:186] Waiting for output from MQLLMEngine.
Loading safetensors checkpoint shards: 41% Completed | 7/17 [00:02<00:04, 2.45it/s]
Loading safetensors checkpoint shards: 47% Completed | 8/17 [00:03<00:03, 2.81it/s]
DEBUG 01-09 07:32:44 utils.py:156] Loaded weight lm_head.weight with shape torch.Size([64128, 4096])
Loading safetensors checkpoint shards: 53% Completed | 9/17 [00:03<00:02, 3.18it/s]
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:44 utils.py:156] Loaded weight lm_head.weight with shape torch.Size([64128, 4096])
Loading safetensors checkpoint shards: 59% Completed | 10/17 [00:03<00:02, 3.04it/s]
Loading safetensors checkpoint shards: 65% Completed | 11/17 [00:04<00:02, 2.88it/s]
Loading safetensors checkpoint shards: 71% Completed | 12/17 [00:04<00:01, 2.81it/s]
Loading safetensors checkpoint shards: 76% Completed | 13/17 [00:04<00:01, 2.59it/s]
Loading safetensors checkpoint shards: 88% Completed | 15/17 [00:05<00:00, 3.22it/s]
Loading safetensors checkpoint shards: 94% Completed | 16/17 [00:05<00:00, 2.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:06<00:00, 2.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:06<00:00, 2.74it/s]
INFO 01-09 07:32:47 model_runner.py:1099] Loading model weights took 7.5122 GB
(VllmWorkerProcess pid=348) INFO 01-09 07:32:48 model_runner.py:1099] Loading model weights took 7.5122 GB
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
INFO 01-09 07:32:55 worker.py:241] Memory profiling takes 6.84 seconds
INFO 01-09 07:32:55 worker.py:241] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.95) = 45.16GiB
INFO 01-09 07:32:55 worker.py:241] model weights take 7.51GiB; non_torch_memory takes 0.68GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 35.78GiB.
(VllmWorkerProcess pid=348) INFO 01-09 07:32:55 worker.py:241] Memory profiling takes 6.95 seconds
(VllmWorkerProcess pid=348) INFO 01-09 07:32:55 worker.py:241] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.95) = 45.16GiB
(VllmWorkerProcess pid=348) INFO 01-09 07:32:55 worker.py:241] model weights take 7.51GiB; non_torch_memory takes 0.64GiB; PyTorch activation peak memory takes 0.24GiB; the rest of the memory reserved for KV Cache is 36.76GiB.
INFO 01-09 07:32:55 distributed_gpu_executor.py:57] # GPU blocks: 36643, # CPU blocks: 4096
INFO 01-09 07:32:55 distributed_gpu_executor.py:61] Maximum concurrency for 131072 tokens per request: 4.47x
INFO 01-09 07:32:59 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing
gpu_memory_utilization
or switching to eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.Capturing CUDA graph shapes: 3%|▎ | 1/35 [00:00<00:26, 1.30it/s](VllmWorkerProcess pid=348) INFO 01-09 07:33:00 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing
gpu_memory_utilization
or switching to eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.Capturing CUDA graph shapes: 14%|█▍ | 5/35 [00:04<00:25, 1.18it/s]DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
Capturing CUDA graph shapes: 49%|████▊ | 17/35 [00:14<00:15, 1.19it/s]DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
Capturing CUDA graph shapes: 80%|████████ | 28/35 [00:23<00:05, 1.19it/s]DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:30<00:00, 1.15it/s]
INFO 01-09 07:33:29 custom_all_reduce.py:224] Registering 2275 cuda graph addresses
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
(VllmWorkerProcess pid=348) INFO 01-09 07:33:45 custom_all_reduce.py:224] Registering 2275 cuda graph addresses
(VllmWorkerProcess pid=348) INFO 01-09 07:33:45 model_runner.py:1535] Graph capturing finished in 45 secs, took 0.98 GiB
INFO 01-09 07:33:45 model_runner.py:1535] Graph capturing finished in 46 secs, took 0.98 GiB
INFO 01-09 07:33:45 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 57.23 seconds
DEBUG 01-09 07:33:45 engine.py:130] Starting Startup Loop.
DEBUG 01-09 07:33:45 engine.py:132] Starting Engine Loop.
DEBUG 01-09 07:33:46 api_server.py:262] vLLM to use /tmp/tmpc23ajfua as PROMETHEUS_MULTIPROC_DIR
INFO 01-09 07:33:46 api_server.py:640] Using supplied chat template:
INFO 01-09 07:33:46 api_server.py:640] None
INFO 01-09 07:33:46 serving_chat.py:73] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 01-09 07:33:46 launcher.py:19] Available routes are:
INFO 01-09 07:33:46 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /health, Methods: GET
INFO 01-09 07:33:46 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-09 07:33:46 launcher.py:27] Route: /version, Methods: GET
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /pooling, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /score, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:55 client.py:165] Heartbeat successful.
DEBUG 01-09 07:33:55 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 01-09 07:33:56 engine.py:190] Waiting for new requests in engine loop.
INFO: 192.168.20.118:57735 - "GET /v1/models HTTP/1.1" 200 OK
INFO 01-09 07:35:30 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set
--chat-template-content-format
to override this.INFO: 192.168.20.118:57735 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 192.168.20.118:57735 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 192.168.20.118:57735 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:36 client.py:165] Heartbeat successful.
Python script:
$ python vLLM_OpenAI_Compatible_Tool.py
Traceback (most recent call last):
File "C:\Users\Username\Documents\Coding\AI\vLLM_OpenAI_Compatible_Tool.py", line 63, in
chat_completion = client.chat.completions.create(messages=messages,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_utils_utils.py", line 274, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai\resources\chat\completions.py", line 742, in create
return self._post(
^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1277, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 954, in request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1043, in _request
return self._retry_request(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1092, in _retry_request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1043, in _request
return self._retry_request(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1092, in _retry_request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1058, in _request
raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500
🐛 Describe the bug
When I run the vLLM Docker container, I can't get the tool calling to work. Not even the official example of this repo (example/openai_chat_completion_client_with_tools.py) works, so I'm sure there must be an issue. The normal chat completions endpoint with streaming and without tools works fine.
Reproduce with the following command:
docker container stop vllm; docker container rm vllm; docker run --name vllm --runtime nvidia -e "NVIDIA_VISIBLE_DEVICES=GPU-XXXXXXXXXXXXXXXXXXXXXXXXXX,GPU-XXXXXXXXXXXXXXXXXXXXXXXXXX" -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxx" --ipc=host -p 8000:8000 vllm/vllm-openai --gpu-memory-utilization 0.95 --model cstr/llama3.1-8b-spaetzle-v90 --tensor-parallel-size 2 --served-model-name llama3.1-8b-spaetzle-v90 --enable-auto-tool-choice --tool-call-parser hermes
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: