[Bug]: example/openai_chat_completion_client_with_tools.py not working #11903

Hurricane31337 · 2025-01-09T15:39:52Z

Your current environment

The output of `python collect_env.py`

Not relevant, it's a Docker container (running on Ubuntu 24.04.1 LTS)

Model Input Dumps

vLLM:
$ docker container stop vllm; docker container rm vllm; docker run --name vllm --runtime nvidia -e "VLLM_LOGGING_LEVEL=DEBUG" -e "NVIDIA_VISIBLE_DEVICES=GPU-8cba8394-b5d6-1e92-6658-bb6efc08abff,GPU-c05c3905-fdd9-34a3-f6c0-1437beb91c7d" -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=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" --ipc=host -p 8000:8000 vllm/vllm-openai --gpu-memory-utilization 0.95 --model cstr/llama3.1-8b-spaetzle-v90 --served-model-name llama3.1-8b-spaetzle-v90 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes
vllm
vllm
INFO 01-09 07:32:00 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-09 07:32:00 api_server.py:713] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='hermes', tool_parser_plugin='', model='cstr/llama3.1-8b-spaetzle-v90', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama3.1-8b-spaetzle-v90'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
DEBUG 01-09 07:32:00 init.py:60] No plugins found.
DEBUG 01-09 07:32:00 api_server.py:180] Multiprocessing frontend to use ipc:///tmp/29b13bff-e5f0-4028-8a35-f5fce0df81c1 for IPC Path.
INFO 01-09 07:32:00 api_server.py:199] Started engine process with PID 76
DEBUG 01-09 07:32:08 init.py:60] No plugins found.
INFO 01-09 07:32:12 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
INFO 01-09 07:32:12 config.py:1310] Defaulting to use mp for distributed inference
WARNING 01-09 07:32:12 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 01-09 07:32:12 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 01-09 07:32:18 config.py:510] This model supports multiple tasks: {'generate', 'reward', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
INFO 01-09 07:32:18 config.py:1310] Defaulting to use mp for distributed inference
WARNING 01-09 07:32:18 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 01-09 07:32:18 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 01-09 07:32:18 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cstr/llama3.1-8b-spaetzle-v90', speculative_config=None, tokenizer='cstr/llama3.1-8b-spaetzle-v90', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama3.1-8b-spaetzle-v90, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 01-09 07:32:19 multiproc_worker_utils.py:312] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-09 07:32:19 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=348) INFO 01-09 07:32:21 selector.py:120] Using Flash Attention backend.
(VllmWorkerProcess pid=348) INFO 01-09 07:32:21 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
INFO 01-09 07:32:21 selector.py:120] Using Flash Attention backend.
DEBUG 01-09 07:32:22 parallel_state.py:959] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:38131 backend=nccl
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:22 parallel_state.py:959] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:38131 backend=nccl
INFO 01-09 07:32:22 utils.py:918] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=348) INFO 01-09 07:32:22 utils.py:918] Found nccl from library libnccl.so.2
INFO 01-09 07:32:22 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=348) INFO 01-09 07:32:22 pynccl.py:69] vLLM is using nccl==2.21.5
DEBUG 01-09 07:32:23 client.py:186] Waiting for output from MQLLMEngine.
INFO 01-09 07:32:23 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
DEBUG 01-09 07:32:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:33 client.py:186] Waiting for output from MQLLMEngine.
INFO 01-09 07:32:39 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=348) INFO 01-09 07:32:39 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
DEBUG 01-09 07:32:40 shm_broadcast.py:215] Binding to tcp://127.0.0.1:49403
INFO 01-09 07:32:40 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_bd2e25f8'), local_subscribe_port=49403, remote_subscribe_port=None)
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 shm_broadcast.py:279] Connecting to tcp://127.0.0.1:49403
INFO 01-09 07:32:40 model_runner.py:1094] Starting to load model cstr/llama3.1-8b-spaetzle-v90...
(VllmWorkerProcess pid=348) INFO 01-09 07:32:40 model_runner.py:1094] Starting to load model cstr/llama3.1-8b-spaetzle-v90...
DEBUG 01-09 07:32:40 decorators.py:105] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 decorators.py:105] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 01-09 07:32:40 config.py:3285] enabled custom ops: Counter({'rms_norm': 65, 'silu_and_mul': 32, 'rotary_embedding': 1})
DEBUG 01-09 07:32:40 config.py:3287] disabled custom ops: Counter()
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 config.py:3285] enabled custom ops: Counter({'rms_norm': 65, 'silu_and_mul': 32, 'rotary_embedding': 1})
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:40 config.py:3287] disabled custom ops: Counter()
INFO 01-09 07:32:40 weight_utils.py:251] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
(VllmWorkerProcess pid=348) INFO 01-09 07:32:40 weight_utils.py:251] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:00<00:05, 2.77it/s]
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:00<00:05, 2.64it/s]
Loading safetensors checkpoint shards: 18% Completed | 3/17 [00:01<00:05, 2.41it/s]
Loading safetensors checkpoint shards: 24% Completed | 4/17 [00:01<00:05, 2.34it/s]
Loading safetensors checkpoint shards: 29% Completed | 5/17 [00:02<00:05, 2.33it/s]
Loading safetensors checkpoint shards: 35% Completed | 6/17 [00:02<00:04, 2.44it/s]
DEBUG 01-09 07:32:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:43 client.py:186] Waiting for output from MQLLMEngine.
Loading safetensors checkpoint shards: 41% Completed | 7/17 [00:02<00:04, 2.45it/s]
Loading safetensors checkpoint shards: 47% Completed | 8/17 [00:03<00:03, 2.81it/s]
DEBUG 01-09 07:32:44 utils.py:156] Loaded weight lm_head.weight with shape torch.Size([64128, 4096])
Loading safetensors checkpoint shards: 53% Completed | 9/17 [00:03<00:02, 3.18it/s]
(VllmWorkerProcess pid=348) DEBUG 01-09 07:32:44 utils.py:156] Loaded weight lm_head.weight with shape torch.Size([64128, 4096])
Loading safetensors checkpoint shards: 59% Completed | 10/17 [00:03<00:02, 3.04it/s]
Loading safetensors checkpoint shards: 65% Completed | 11/17 [00:04<00:02, 2.88it/s]
Loading safetensors checkpoint shards: 71% Completed | 12/17 [00:04<00:01, 2.81it/s]
Loading safetensors checkpoint shards: 76% Completed | 13/17 [00:04<00:01, 2.59it/s]
Loading safetensors checkpoint shards: 88% Completed | 15/17 [00:05<00:00, 3.22it/s]
Loading safetensors checkpoint shards: 94% Completed | 16/17 [00:05<00:00, 2.94it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:06<00:00, 2.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:06<00:00, 2.74it/s]

INFO 01-09 07:32:47 model_runner.py:1099] Loading model weights took 7.5122 GB
(VllmWorkerProcess pid=348) INFO 01-09 07:32:48 model_runner.py:1099] Loading model weights took 7.5122 GB
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:32:53 client.py:186] Waiting for output from MQLLMEngine.
INFO 01-09 07:32:55 worker.py:241] Memory profiling takes 6.84 seconds
INFO 01-09 07:32:55 worker.py:241] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.95) = 45.16GiB
INFO 01-09 07:32:55 worker.py:241] model weights take 7.51GiB; non_torch_memory takes 0.68GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 35.78GiB.
(VllmWorkerProcess pid=348) INFO 01-09 07:32:55 worker.py:241] Memory profiling takes 6.95 seconds
(VllmWorkerProcess pid=348) INFO 01-09 07:32:55 worker.py:241] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.95) = 45.16GiB
(VllmWorkerProcess pid=348) INFO 01-09 07:32:55 worker.py:241] model weights take 7.51GiB; non_torch_memory takes 0.64GiB; PyTorch activation peak memory takes 0.24GiB; the rest of the memory reserved for KV Cache is 36.76GiB.
INFO 01-09 07:32:55 distributed_gpu_executor.py:57] # GPU blocks: 36643, # CPU blocks: 4096
INFO 01-09 07:32:55 distributed_gpu_executor.py:61] Maximum concurrency for 131072 tokens per request: 4.47x
INFO 01-09 07:32:59 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 3%|▎ | 1/35 [00:00<00:26, 1.30it/s](VllmWorkerProcess pid=348) INFO 01-09 07:33:00 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 14%|█▍ | 5/35 [00:04<00:25, 1.18it/s]DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:03 client.py:186] Waiting for output from MQLLMEngine.
Capturing CUDA graph shapes: 49%|████▊ | 17/35 [00:14<00:15, 1.19it/s]DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:13 client.py:186] Waiting for output from MQLLMEngine.
Capturing CUDA graph shapes: 80%|████████ | 28/35 [00:23<00:05, 1.19it/s]DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:23 client.py:186] Waiting for output from MQLLMEngine.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:30<00:00, 1.15it/s]
INFO 01-09 07:33:29 custom_all_reduce.py:224] Registering 2275 cuda graph addresses
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:43 client.py:186] Waiting for output from MQLLMEngine.
(VllmWorkerProcess pid=348) INFO 01-09 07:33:45 custom_all_reduce.py:224] Registering 2275 cuda graph addresses
(VllmWorkerProcess pid=348) INFO 01-09 07:33:45 model_runner.py:1535] Graph capturing finished in 45 secs, took 0.98 GiB
INFO 01-09 07:33:45 model_runner.py:1535] Graph capturing finished in 46 secs, took 0.98 GiB
INFO 01-09 07:33:45 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 57.23 seconds
DEBUG 01-09 07:33:45 engine.py:130] Starting Startup Loop.
DEBUG 01-09 07:33:45 engine.py:132] Starting Engine Loop.
DEBUG 01-09 07:33:46 api_server.py:262] vLLM to use /tmp/tmpc23ajfua as PROMETHEUS_MULTIPROC_DIR
INFO 01-09 07:33:46 api_server.py:640] Using supplied chat template:
INFO 01-09 07:33:46 api_server.py:640] None
INFO 01-09 07:33:46 serving_chat.py:73] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
INFO 01-09 07:33:46 launcher.py:19] Available routes are:
INFO 01-09 07:33:46 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 01-09 07:33:46 launcher.py:27] Route: /health, Methods: GET
INFO 01-09 07:33:46 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-09 07:33:46 launcher.py:27] Route: /version, Methods: GET
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /pooling, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /score, Methods: POST
INFO 01-09 07:33:46 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:53 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:33:55 client.py:165] Heartbeat successful.
DEBUG 01-09 07:33:55 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 01-09 07:33:56 engine.py:190] Waiting for new requests in engine loop.
INFO: 192.168.20.118:57735 - "GET /v1/models HTTP/1.1" 200 OK
INFO 01-09 07:35:30 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
INFO: 192.168.20.118:57735 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 192.168.20.118:57735 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 192.168.20.118:57735 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:33 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 07:35:36 client.py:165] Heartbeat successful.

Python script:
$ python vLLM_OpenAI_Compatible_Tool.py
Traceback (most recent call last):
File "C:\Users\Username\Documents\Coding\AI\vLLM_OpenAI_Compatible_Tool.py", line 63, in
chat_completion = client.chat.completions.create(messages=messages,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_utils_utils.py", line 274, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai\resources\chat\completions.py", line 742, in create
return self._post(
^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1277, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 954, in request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1043, in _request
return self._retry_request(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1092, in _retry_request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1043, in _request
return self._retry_request(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1092, in _retry_request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\Username\AppData\Local\Programs\Python\Python312\Lib\site-packages\openai_base_client.py", line 1058, in _request
raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500

🐛 Describe the bug

When I run the vLLM Docker container, I can't get the tool calling to work. Not even the official example of this repo (example/openai_chat_completion_client_with_tools.py) works, so I'm sure there must be an issue. The normal chat completions endpoint with streaming and without tools works fine.

Reproduce with the following command:
docker container stop vllm; docker container rm vllm; docker run --name vllm --runtime nvidia -e "NVIDIA_VISIBLE_DEVICES=GPU-XXXXXXXXXXXXXXXXXXXXXXXXXX,GPU-XXXXXXXXXXXXXXXXXXXXXXXXXX" -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxx" --ipc=host -p 8000:8000 vllm/vllm-openai --gpu-memory-utilization 0.95 --model cstr/llama3.1-8b-spaetzle-v90 --tensor-parallel-size 2 --served-model-name llama3.1-8b-spaetzle-v90 --enable-auto-tool-choice --tool-call-parser hermes

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2025-01-09T15:56:24Z

You should use the tool calling chat template in examples/tool_chat_template_llama3.1_json.jinja

Hurricane31337 · 2025-01-09T16:52:26Z

You're right, thanks! Still the same Internal Server Error 500, though.

`INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] INFO 01-09 08:49:54 api_server.py:640] Using supplied chat template:
{{- bos_token }}
{%- if custom_tools is defined %}
{%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
{%- set tools_in_user_message = false %}
{%- endif %}
{%- if not date_string is defined %}
{%- if strftime_now is defined %}
{%- set date_string = strftime_now("%d %b %Y") %}
{%- else %}
{%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}

{#- Find out if there are any images #}
{% set image_ns = namespace(has_images=false) %}
{%- for message in messages %}
{%- for content in message['content'] %}
{%- if content['type'] == 'image' %}
{%- set image_ns.has_images = true %}
{%- endif %}
{%- endfor %}
{%- endfor %}

{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
{%- if messages[0]['content'] is string %}
{%- set system_message = messages[0]['content']|trim %}
{%- else %}
{%- set system_message = messages[0]['content'][0]['text']|trim %}
{%- endif %}
{%- set messages = messages[1:] %}
{%- else %}
{%- if tools is not none %}
{%- set system_message = "You are a helpful assistant with tool calling capabilities. Only reply with a tool call if the function exists in the library provided by the user. If it doesn't exist, just reply directly in natural language. When you receive a tool call response, use the output to format an answer to the original user question." %}
{%- else %}
{%- set system_message = "" %}
{%- endif %}
{%- endif %}

{#- System message if there are no images, if the user supplied one, or if tools are used (default tool system message) #}
{%- if system_message or not image_ns.has_images %}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
{{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
{{- "You have access to the following functions. To call a function, please respond with JSON for a function call. " }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. ' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}
{%- endif %}

{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
{#- Extract the first user message so we can plug it in here #}
{%- if messages | length != 0 %}
{%- if messages[0]['content'] is string %}
{%- set first_user_message = messages[0]['content']|trim %}
{%- else %}
{%- set first_user_message = messages[0]['content'] | selectattr('type', 'equalto', 'text') | map(attribute='text') | map('trim') | join('\n') %}
{%- endif %}
{%- set messages = messages[1:] %}
{%- else %}
{{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
{{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
{{- "Given the following functions, please respond with a JSON for a function call " }}
{{- "with its proper arguments that best answers the given prompt.\n\n" }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. ' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{{- first_user_message + "<|eot_id|>"}}
{%- endif %}

{%- for message in messages %}
{%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{%- if message['content'] is string %}
{{- message['content'] | trim}}
{%- else %}
{%- for content in message['content'] %}
{%- if content['type'] == 'image' %}
{{- '<|image|>' }}
{%- elif content['type'] == 'text' %}
{{- content['text'] | trim }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- '<|eot_id|>' }}
{%- elif 'tool_calls' in message %}
{%- if not message.tool_calls|length == 1 %}
{{- raise_exception("This model only supports single tool-calls at once!") }}
{%- endif %}
{%- set tool_call = message.tool_calls[0].function %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
{{- '{"name": "' + tool_call.name + '", ' }}
{{- '"parameters": ' }}
{{- tool_call.arguments | tojson }}
{{- "}" }}
{{- "<|eot_id|>" }}
{%- elif message.role == "tool" or message.role == "ipython" %}
{{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
{%- if message.content is string %}
{{- { "output": message.content } | tojson }}
{%- else %}
{%- for content in message['content'] %}
{%- if content['type'] == 'text' %}
{{- { "output": content['text'] } | tojson }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- "<|eot_id|>" }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}

...

DEBUG 01-09 08:50:54 client.py:165] Heartbeat successful.
DEBUG 01-09 08:50:54 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 01-09 08:50:54 client.py:165] Heartbeat successful.
DEBUG 01-09 08:50:54 engine.py:190] Waiting for new requests in engine loop.
INFO: 192.168.20.118:61598 - "GET /v1/models HTTP/1.1" 200 OK
INFO 01-09 08:51:01 chat_utils.py:333] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format to override this.
INFO: 192.168.20.118:61598 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 192.168.20.118:61598 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 192.168.20.118:61598 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
DEBUG 01-09 08:51:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 08:51:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 08:51:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 08:51:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 08:51:03 client.py:186] Waiting for output from MQLLMEngine.
DEBUG 01-09 08:51:03 client.py:186] Waiting for output from MQLLMEngine.

`

DarkLight1337 · 2025-01-09T16:59:09Z

Do you get any error logs when the internal server error occurs? If not, try passing --disable-frontend-multiprocessing to get more detailed logs.

Hurricane31337 · 2025-01-10T08:30:14Z

Sadly not:

$ docker container stop vllm; docker container rm vllm; docker run --name vllm --runtime nvidia -e "VLLM_LOGGING_LEVEL=DEBUG" -e "NVIDIA_VISIBLE_DEVICES=GPU-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX,GPU-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" -v ~/.cache/huggingface:/root/.cache/huggingface -v ~/AI/vLLM:/root/vLLM --env "HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx" --ipc=host -p 8000:8000 vllm/vllm-openai --gpu-memory-utilization 0.95 --model cstr/llama3.1-8b-spaetzle-v90 --served-model-name llama3.1-8b-spaetzle-v90 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser hermes --chat-template /root/vLLM/tool_chat_template_llama3.2_json.jinja --disable-frontend-multiprocessing --uvicorn-log-level debug

DEBUG 01-10 00:24:24 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO: 10.242.3.24:50684 - "GET /v1/models HTTP/1.1" 200 OK
INFO 01-10 00:24:29 chat_utils.py:333] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format to override this.
INFO: 10.242.3.24:50684 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 10.242.3.24:50684 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 10.242.3.24:50684 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
DEBUG 01-10 00:24:34 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

I don't know if the uvicorn logging might tell something useful, I don't know how to get that log output.

Do you have a server with a GPU running Docker to reproduce this? If you have a HuggingFace account (and token), you should be able to reproduce it with the docker command mentioned above and the official example script.

DarkLight1337 · 2025-01-10T08:45:34Z

cc @K-Mistele @heheda12345

Hurricane31337 added the bug Something isn't working label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: example/openai_chat_completion_client_with_tools.py not working #11903

[Bug]: example/openai_chat_completion_client_with_tools.py not working #11903

Hurricane31337 commented Jan 9, 2025

DarkLight1337 commented Jan 9, 2025

Hurricane31337 commented Jan 9, 2025

DarkLight1337 commented Jan 9, 2025

Hurricane31337 commented Jan 10, 2025

DarkLight1337 commented Jan 10, 2025

[Bug]: example/openai_chat_completion_client_with_tools.py not working #11903

[Bug]: example/openai_chat_completion_client_with_tools.py not working #11903

Comments

Hurricane31337 commented Jan 9, 2025

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

DarkLight1337 commented Jan 9, 2025

Hurricane31337 commented Jan 9, 2025

DarkLight1337 commented Jan 9, 2025

Hurricane31337 commented Jan 10, 2025

DarkLight1337 commented Jan 10, 2025