Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 24 12 09 #314

Merged
merged 81 commits into from
Dec 9, 2024
Merged
Changes from 1 commit
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-redhat Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
01d079f
[LoRA] Change lora_tokenizers capacity (#10796)
xyang16 Dec 4, 2024
10398b4
[Model] Consolidate ViTs attention implementation without mask (#10893)
Isotr0py Dec 4, 2024
82eb5ea
Benchmark serving structured output (#10880)
xuechendi Dec 4, 2024
e4c34c2
[CI/Build] improve python-only dev setup (#9621)
dtrifiro Dec 4, 2024
2a56e12
[V1] Fix when max_model_len is not divisible by block_size (#10903)
WoosukKwon Dec 5, 2024
7883c2b
[benchmark] Make H100 benchmark optional (#10908)
khluu Dec 5, 2024
8d370e9
[Bugfix] Fallback to outlines for complex json schemas (#10899)
mgoin Dec 5, 2024
aa39a8e
[Doc] Create a new "Usage" section (#10827)
DarkLight1337 Dec 5, 2024
1f958a7
[Bugfix] Fix BNB loader target_modules (#10720)
jeejeelee Dec 5, 2024
39c89e7
[Misc] Update llama 3.2 template to support system prompt with images…
tjohnson31415 Dec 5, 2024
571da8f
[Misc][LoRA] Clean up the function interface of Punica (#10917)
jeejeelee Dec 5, 2024
998eeaf
[CI/Build] Bump test transformers version (#10106)
Isotr0py Dec 5, 2024
a430652
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
kzawora-intel Dec 5, 2024
9743d64
[ci][build] add tests for python only compilation (#10915)
youkaichao Dec 5, 2024
db87eb6
[torch.compile] use size tuning for specific sizes (#10933)
youkaichao Dec 6, 2024
b031a45
[torch.compile] add logging for compilation time (#10941)
youkaichao Dec 6, 2024
222f5b0
[CI/Build] Fix broken multimodal test (#10950)
DarkLight1337 Dec 6, 2024
a1887f2
[torch.compile] fix deprecated code (#10948)
youkaichao Dec 6, 2024
8b59631
[Core] Support Lark grammars for XGrammar (#10870)
mgoin Dec 6, 2024
7406274
[Doc] add KubeAI to serving integrations (#10837)
samos123 Dec 6, 2024
c05cfb6
[misc] fix typo (#10960)
youkaichao Dec 6, 2024
dcdc3fa
[ci] fix broken tests (#10956)
youkaichao Dec 6, 2024
69d357b
[Core] Cleanup startup logging a bit (#10961)
russellb Dec 7, 2024
acf092d
[Bugfix] Fix test-pipeline.yaml (#10973)
jeejeelee Dec 7, 2024
955fa95
[3/N] Support and implement merged input processor for LLaVA model (#…
DarkLight1337 Dec 7, 2024
f13cf9a
[Build] Fix for the Wswitch-bool clang warning (#10060)
gshtras Dec 7, 2024
b26b4cd
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora imple…
Isotr0py Dec 7, 2024
bf0e382
[Model] Composite weight loading for multimodal Qwen2 (#10944)
DarkLight1337 Dec 7, 2024
1c768fe
[Doc] Explicitly state that InternVL 2.5 is supported (#10978)
DarkLight1337 Dec 7, 2024
39e227c
[Model] Update multi-modal processor to support Mantis(LLaVA) model (…
DarkLight1337 Dec 7, 2024
c889d58
[Doc] Explicitly state that PP isn't compatible with speculative deco…
DarkLight1337 Dec 7, 2024
78029b3
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when con…
xffxff Dec 7, 2024
1b62745
[core][executor] simplify instance id (#10976)
youkaichao Dec 7, 2024
7be15d9
[core][misc] remove use_dummy driver for _run_workers (#10920)
youkaichao Dec 7, 2024
fd57d2b
[torch.compile] allow candidate compile sizes (#10984)
youkaichao Dec 8, 2024
a11f326
[V1] Initial support of multimodal models for V1 re-arch (#10699)
ywang96 Dec 8, 2024
43b05fa
[torch.compile][misc] fix comments (#10993)
youkaichao Dec 8, 2024
46004e8
[misc] clean up and unify logging (#10999)
youkaichao Dec 9, 2024
af7c4a9
[Doc][V1] Add V1 support column for multimodal models (#10998)
ywang96 Dec 9, 2024
d1c2e15
[torch.compile] add dynamo time tracking (#11005)
youkaichao Dec 9, 2024
c690357
[V1] Fix Detokenizer loading in `AsyncLLM` (#10997)
ywang96 Dec 9, 2024
e691b26
[Core] Require xgrammar >= 0.1.6 (#11021)
russellb Dec 9, 2024
aea2fc3
[Platform] Move `async output` check to platform (#10768)
wangxiyuan Dec 9, 2024
25b79d9
[V1] Input Batch Relocation (#10962)
varun-sundar-rabindranath Dec 9, 2024
edc4fa3
[ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
khluu Dec 9, 2024
3b61cb4
[V1] Further reduce CPU overheads in flash-attn (#10989)
WoosukKwon Dec 9, 2024
ca87149
[Misc][LoRA] Abstract PunicaWrapper (#10955)
jeejeelee Dec 9, 2024
a811dd6
[Model] merged input processor for Phi-3-Vision models (#10977)
Isotr0py Dec 9, 2024
cbcbdb1
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
kzawora-intel Dec 9, 2024
1a2f8fb
[v1] fix use compile sizes (#11000)
youkaichao Dec 9, 2024
9c6459e
[Neuron] Upgrade neuron to 2.20.2 (#11016)
xendo Dec 9, 2024
b63ba84
[ROCm][bugfix] scpecilative decoding worker class (#11035)
gshtras Dec 9, 2024
7c61516
Merge remote-tracking branch 'upstream/main' into develop
gshtras Dec 9, 2024
401a541
format
gshtras Dec 9, 2024
c9f5c24
Merge remote-tracking branch 'origin/main' into upstream_merge_24_12_09
gshtras Dec 9, 2024
c324ea8
Merge remote-tracking branch 'upstream/main' into upstream_merge_24_1…
gshtras Dec 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[V1] Fix Detokenizer loading in AsyncLLM (vllm-project#10997)
Signed-off-by: Roger Wang <ywang@roblox.com>
  • Loading branch information
ywang96 authored Dec 9, 2024
commit c690357928fd2812f450bfb0c3629a816f5e9a55
7 changes: 6 additions & 1 deletion vllm/v1/engine/async_llm.py
Original file line number Diff line number Diff line change
@@ -65,7 +65,12 @@ def __init__(
input_registry)

# Detokenizer (converts EngineCoreOutputs --> RequestOutput).
self.detokenizer = Detokenizer(vllm_config.model_config.tokenizer)
self.detokenizer = Detokenizer(
tokenizer_name=vllm_config.model_config.tokenizer,
tokenizer_mode=vllm_config.model_config.tokenizer_mode,
trust_remote_code=vllm_config.model_config.trust_remote_code,
revision=vllm_config.model_config.tokenizer_revision,
)

# EngineCore (starts the engine in background process).
self.engine_core = EngineCoreClient.make_client(