Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MFM-20250115] Merge from ROCm/main to llama_fp8 #360

Merged
merged 537 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
537 commits
Select commit Hold shift + click to select a range
196c34b
[Misc] Move weights mapper (#11443)
jeejeelee Dec 24, 2024
409475a
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 (#11435)
terrytangyuan Dec 24, 2024
3f3e92e
[Model] Automatic conversion of classification and reward models (#11…
DarkLight1337 Dec 24, 2024
9832e55
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (#1…
ruisearch42 Dec 25, 2024
fc60166
[Misc] Update disaggregation benchmark scripts and test logs (#11456)
Jeffwan Dec 25, 2024
b689ada
[Frontend] Enable decord to load video from base64 (#11492)
DarkLight1337 Dec 25, 2024
6ad909f
[Doc] Improve GitHub links (#11491)
DarkLight1337 Dec 25, 2024
51a624b
[Misc] Move some multimodal utils to modality-specific modules (#11494)
DarkLight1337 Dec 26, 2024
dbeac95
Mypy checking for vllm/compilation (#11496)
lucas-tucker Dec 26, 2024
aa25985
[Misc][LoRA] Fix LoRA weight mapper (#11495)
jeejeelee Dec 26, 2024
7492a36
[Doc] Add `QVQ` and `QwQ` to the list of supported models (#11509)
ywang96 Dec 26, 2024
dcb1a94
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 …
sroy745 Dec 26, 2024
f57ee56
[Model] Modify MolmoForCausalLM MLP (#11510)
jeejeelee Dec 26, 2024
eec906d
[Misc] Add placeholder module (#11501)
DarkLight1337 Dec 26, 2024
b85a977
[Doc] Add video example to openai client for multimodal (#11521)
Isotr0py Dec 26, 2024
720b10f
[1/N] API Server (Remove Proxy) (#11529)
robertgshaw2-redhat Dec 26, 2024
2072924
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quanti…
mgoin Dec 26, 2024
55fb97f
[2/N] API Server: Avoid ulimit footgun (#11530)
robertgshaw2-redhat Dec 26, 2024
f49777b
Deepseek v3 (#11502)
simon-mo Dec 27, 2024
82d24f7
[Docs] Document Deepseek V3 support (#11535)
simon-mo Dec 27, 2024
0c0c201
Update openai_compatible_server.md (#11536)
robertgshaw2-redhat Dec 27, 2024
371d04d
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394)
WoosukKwon Dec 27, 2024
81b979f
[V1] Fix yapf (#11538)
WoosukKwon Dec 27, 2024
46d4359
[CI] Fix broken CI (#11543)
robertgshaw2-redhat Dec 27, 2024
eb881ed
[misc] fix typing (#11540)
youkaichao Dec 27, 2024
1b875a0
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (…
robertgshaw2-redhat Dec 27, 2024
2339d59
[BugFix] Fix quantization for all other methods (#11547)
robertgshaw2-redhat Dec 27, 2024
6c6f7fe
[Platform] Move model arch check to platform (#11503)
MengqingCao Dec 27, 2024
d003f3e
Update deploying_with_k8s.md with AMD ROCm GPU example (#11465)
AlexHe99 Dec 27, 2024
2c9b8ea
[Bugfix] Fix TeleChat2ForCausalLM weights mapper (#11546)
jeejeelee Dec 27, 2024
7af553e
[Misc] Abstract the logic for reading and writing media content (#11527)
DarkLight1337 Dec 27, 2024
5ce4627
[Doc] Add xgrammar in doc (#11549)
Chen-0210 Dec 27, 2024
1014180
[VLM] Support caching in merged multi-modal processor (#11396)
DarkLight1337 Dec 27, 2024
55509c2
[MODEL] LoRA support for Jamba model (#11209)
ErezSC42 Dec 27, 2024
0240402
[Misc]Add BNB quantization for MolmoForCausalLM (#11551)
jeejeelee Dec 27, 2024
dde1fa1
[Misc] Improve BNB loader to handle mixture of sharded and merged wei…
Isotr0py Dec 27, 2024
ac79799
[Bugfix] Fix for ROCM compressed tensor support (#11561)
selalipop Dec 27, 2024
a607312
[Doc] Update mllama example based on official doc (#11567)
heheda12345 Dec 28, 2024
df04dff
[V1] [4/N] API Server: ZMQ/MP Utilities (#11541)
robertgshaw2-redhat Dec 28, 2024
b5cbe8e
[Bugfix] Last token measurement fix (#11376)
rajveerb Dec 28, 2024
d34be24
[Model] Support InternLM2 Reward models (#11571)
Isotr0py Dec 28, 2024
b7dcc00
[Model] Remove hardcoded image tokens ids from Pixtral (#11582)
ywang96 Dec 28, 2024
59d6bb4
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version…
hj-wei Dec 28, 2024
42bb201
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor (#11581)
WoosukKwon Dec 28, 2024
d427e5c
[Doc] Minor documentation fixes (#11580)
DarkLight1337 Dec 28, 2024
328841d
[bugfix] interleaving sliding window for cohere2 model (#11583)
youkaichao Dec 28, 2024
4fb8e32
[V1] [5/N] API Server: unify `Detokenizer` and `EngineCore` input (#…
robertgshaw2-redhat Dec 28, 2024
32b4c63
[Doc] Convert list tables to MyST (#11594)
DarkLight1337 Dec 29, 2024
dba4d9d
[v1][bugfix] fix cudagraph with inplace buffer assignment (#11596)
youkaichao Dec 29, 2024
faef77c
[Misc] KV cache transfer connector registry (#11481)
KuntaiDu Dec 29, 2024
0aa38d1
Remove print statement in DeepseekScalingRotaryEmbedding (#11604)
mgoin Dec 29, 2024
3682e33
[v1] fix compilation cache (#11598)
youkaichao Dec 30, 2024
628ec6c
[Docker] bump up neuron sdk v2.21 (#11593)
liangfu Dec 30, 2024
970d6d0
[Build][Kernel] Update CUTLASS to v3.6.0 (#11607)
tlrmchlsmth Dec 30, 2024
5dbf854
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618)
bigPYJ1151 Dec 30, 2024
b12e87f
[platforms] enable platform plugins (#11602)
youkaichao Dec 30, 2024
8d9b672
[VLM] Abstract out multi-modal data parsing in merged processor (#11620)
DarkLight1337 Dec 30, 2024
5886aa4
[V1] [6/N] API Server: Better Shutdown (#11586)
robertgshaw2-redhat Dec 30, 2024
36e7670
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseMod…
whyiug Dec 30, 2024
ccb1aab
[benchmark] Remove dependency for H100 benchmark step (#11572)
khluu Dec 30, 2024
a2a40bc
[Model][LoRA]LoRA support added for MolmoForCausalLM (#11439)
ayylemao Dec 31, 2024
74fa1d1
[Bugfix] Fix OpenAI parallel sampling when using xgrammar (#11637)
mgoin Dec 31, 2024
82c49d3
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (#6909)
JohnGiorgi Dec 31, 2024
2c57188
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots …
sakunkun Dec 31, 2024
8c3230d
[V1] Simpify vision block hash for prefix caching by removing offset …
heheda12345 Dec 31, 2024
e7c7c5e
[V1][VLM] V1 support for selected single-image models. (#11632)
ywang96 Dec 31, 2024
0c6f998
[Benchmark] Add benchmark script for CPU offloading (#11533)
ApostaC Jan 1, 2025
4db72e5
[Bugfix][Refactor] Unify model management in frontend (#11660)
joerunde Jan 1, 2025
365801f
[VLM] Add max-count checking in data parser for single image models (…
DarkLight1337 Jan 1, 2025
11d8a09
[Misc] Optimize Qwen2-VL LoRA test (#11663)
jeejeelee Jan 1, 2025
f962f42
[Misc] Replace space with - in the file names (#11667)
houseroad Jan 1, 2025
6d70198
[Doc] Fix typo (#11666)
serihiro Jan 1, 2025
7300144
[V1] Implement Cascade Attention (#11635)
WoosukKwon Jan 1, 2025
a115ac4
[VLM] Move supported limits and max tokens to merged multi-modal proc…
DarkLight1337 Jan 1, 2025
23c1b10
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (#…
DarkLight1337 Jan 2, 2025
b6087a6
[mypy] Pass type checking in vllm/inputs (#11680)
CloseChoice Jan 2, 2025
8c38ee7
[VLM] Merged multi-modal processor for LLaVA-NeXT (#11682)
DarkLight1337 Jan 2, 2025
84c35c3
According to vllm.EngineArgs, the name should be distributed_executor…
chunyang-wen Jan 2, 2025
2f38518
[Bugfix] Free cross attention block table for preempted-for-recompute…
kathyyu-google Jan 2, 2025
b55ed6e
[V1][Minor] Optimize token_ids_cpu copy (#11692)
WoosukKwon Jan 2, 2025
187e329
[Bugfix] Change kv scaling factor by param json on nvidia gpu (#11688)
bjmsong Jan 2, 2025
5dba257
Resolve race conditions in Marlin kernel (#11493)
wchen61 Jan 2, 2025
68d3780
[Misc] Minimum requirements for SageMaker compatibility (#11576)
nathan-az Jan 2, 2025
2f1e8e8
Update default max_num_batch_tokens for chunked prefill (#11694)
SachinVarghese Jan 3, 2025
07064cb
[Bugfix] Check chain_speculative_sampling before calling it (#11673)
houseroad Jan 3, 2025
fd3a62a
[perf-benchmark] Fix dependency for steps in benchmark pipeline (#11710)
khluu Jan 3, 2025
e1a5c2f
[Model] Whisper model implementation (#11280)
aurickq Jan 3, 2025
80c751e
[V1] Simplify Shutdown (#11659)
robertgshaw2-redhat Jan 3, 2025
61fed92
[Bugfix] Fix ColumnParallelLinearWithLoRA slice (#11708)
zinccat Jan 3, 2025
1543914
[V1] Improve TP>1 Error Handling + Stack Trace (#11721)
robertgshaw2-redhat Jan 3, 2025
a655eb3
[Misc]Add BNB quantization for Qwen2VL (#11719)
jeejeelee Jan 3, 2025
bf0d97d
Update requirements-tpu.txt to support python 3.9 and 3.11 (#11695)
mgoin Jan 3, 2025
ad0d567
[V1] Chore: cruft removal (#11724)
robertgshaw2-redhat Jan 3, 2025
e5d7ed0
[V1] log GPU blocks num for MultiprocExecutor (#11656)
WangErXiao Jan 4, 2025
9c93636
Update tool_calling.md (#11701)
Bryce1010 Jan 4, 2025
d1d4939
Update bnb.md with example for OpenAI (#11718)
bet0x Jan 4, 2025
fbf2564
[V1] Add `RayExecutor` support for `AsyncLLM` (api server) (#11712)
jikunshang Jan 4, 2025
d91457d
[V1] Add kv cache utils tests. (#11513)
xcnick Jan 4, 2025
300acb8
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-…
yanburman Jan 4, 2025
eed11eb
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-On…
DarkLight1337 Jan 4, 2025
ba214df
[Bugfix] Fix precision error in LLaVA-NeXT (#11735)
DarkLight1337 Jan 4, 2025
65c0892
[Model] Remove unnecessary weight initialization logic (#11736)
DarkLight1337 Jan 4, 2025
4783143
[Bugfix][V1] Fix test_kv_cache_utils.py (#11738)
jeejeelee Jan 4, 2025
4068f4b
[MISC] Replace c10::optional with std::optional (#11730)
houseroad Jan 5, 2025
635b897
[distributed] remove pynccl's redundant stream (#11744)
cennn Jan 5, 2025
eba1717
fix: [doc] fix typo (#11751)
RuixiangMa Jan 5, 2025
33fc1e2
[Frontend] Improve `StreamingResponse` Exception Handling (#11752)
robertgshaw2-redhat Jan 5, 2025
9e764e7
[distributed] remove pynccl's redundant change_state (#11749)
cennn Jan 6, 2025
402d378
[Doc] [1/N] Reorganize Getting Started section (#11645)
DarkLight1337 Jan 6, 2025
408e560
[Bugfix] Remove block size constraint (#11723)
comaniac Jan 6, 2025
06bfb51
[V1] Add BlockTable class (#11693)
WoosukKwon Jan 6, 2025
f8fcca1
[Misc] Fix typo for valid_tool_parses (#11753)
ruisearch42 Jan 6, 2025
022c5c6
[V1] Refactor get_executor_cls (#11754)
ruisearch42 Jan 6, 2025
9c74971
[mypy] Forward pass function type hints in lora (#11740)
lucas-tucker Jan 6, 2025
2a622d7
k8s-config: Update the secret to use stringData (#11679)
surajssd Jan 6, 2025
996357e
[VLM] Separate out profiling-related logic (#11746)
DarkLight1337 Jan 6, 2025
ee77fdb
[Doc][2/N] Reorganize Models and Usage sections (#11755)
DarkLight1337 Jan 6, 2025
9279b9f
[Bugfix] Fix max image size for LLaVA-Onevision (#11769)
ywang96 Jan 6, 2025
4ca5d40
[doc] explain how to add interleaving sliding window support (#11771)
youkaichao Jan 6, 2025
32c9eff
[Bugfix][V1] Fix molmo text-only inputs (#11676)
jeejeelee Jan 6, 2025
e20c92b
[Kernel] Move attn_type to Attention.__init__() (#11690)
heheda12345 Jan 6, 2025
4773c29
Merge remote-tracking branch 'upstream/main'
gshtras Jan 6, 2025
267c1a1
format
gshtras Jan 6, 2025
91b361a
[V1] Extend beyond image modality and support mixed-modality inferenc…
ywang96 Jan 6, 2025
2053351
deepseek overflow fix (#349)
Concurrensee Jan 6, 2025
08fb75c
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (#11772)
DarkLight1337 Jan 7, 2025
d0169e1
[Model] Future-proof Qwen2-Audio multi-modal processor (#11776)
DarkLight1337 Jan 7, 2025
d93d2d7
[XPU] Make pp group initilized for pipeline-parallelism (#11648)
ys950902 Jan 7, 2025
8ceffbf
[Doc][3/N] Reorganize Serving section (#11766)
DarkLight1337 Jan 7, 2025
b278557
[Kernel][LoRA]Punica prefill kernels fusion (#11234)
jeejeelee Jan 7, 2025
0f3f3c8
[Bugfix] Update attention interface in `Whisper` (#11784)
ywang96 Jan 7, 2025
898cdf0
[CI] Fix neuron CI and run offline tests (#11779)
liangfu Jan 7, 2025
e512f76
fix init error for MessageQueue when n_local_reader is zero (#11768)
XiaobingSuper Jan 7, 2025
ce1917f
[Doc] Create a vulnerability management team (#9925)
russellb Jan 7, 2025
1e4ce29
[CI][CPU] adding build number to docker image name (#11788)
zhouyuan Jan 7, 2025
8082ad7
[V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (#11798)
ywang96 Jan 7, 2025
8f37be3
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calcula…
DarkLight1337 Jan 7, 2025
869e829
[doc] add doc to explain how to use uv (#11773)
youkaichao Jan 7, 2025
2de197b
[V1] Support audio language models on V1 (#11733)
ywang96 Jan 7, 2025
d9fa1c0
[doc] update how pip can install nightly wheels (#11806)
youkaichao Jan 7, 2025
c0efe92
[Doc] Add note to `gte-Qwen2` models (#11808)
DarkLight1337 Jan 7, 2025
869579a
[optimization] remove python function call for custom op (#11750)
youkaichao Jan 7, 2025
c994223
[Bugfix] update the prefix for qwen2 (#11795)
jiangjiadi Jan 7, 2025
973f5dc
[Doc]Add documentation for using EAGLE in vLLM (#11417)
sroy745 Jan 7, 2025
97067c0
Merge branch 'main' into upstream_merge_25_1_6
gshtras Jan 8, 2025
a4e2b26
[Bugfix] Significant performance drop on CPUs with --num-scheduler-st…
DamonFool Jan 8, 2025
5950f55
[Doc] Group examples into categories (#11782)
hmellor Jan 8, 2025
91445c7
[Bugfix] Fix image input for Pixtral-HF (#11741)
DarkLight1337 Jan 8, 2025
4d29e91
[Misc] sort torch profiler table by kernel timing (#11813)
divakar-amd Jan 8, 2025
dc71af0
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange…
WangErXiao Jan 8, 2025
b640b19
Fixed docker build for ppc64le (#11518)
npanpaliya Jan 8, 2025
f4923cb
[OpenVINO] Fixed Docker.openvino build (#11732)
ilya-lavrenov Jan 8, 2025
f645eb6
[Bugfix] Add checks for LoRA and CPU offload (#11810)
jeejeelee Jan 8, 2025
259abd8
[Docs] reorganize sponsorship page (#11639)
simon-mo Jan 8, 2025
ef68eb2
[Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used…
DarkLight1337 Jan 8, 2025
889e662
[misc] improve memory profiling (#11809)
youkaichao Jan 8, 2025
ad9f1aa
[doc] update wheels url (#11830)
youkaichao Jan 8, 2025
a1b2b86
[Docs] Update sponsor name: 'Novita' to 'Novita AI' (#11833)
simon-mo Jan 8, 2025
cfd3219
[Hardware][Apple] Native support for macOS Apple Silicon (#11696)
wallashss Jan 8, 2025
f121411
[torch.compile] consider relevant code in compilation cache (#11614)
youkaichao Jan 8, 2025
2a0596b
[VLM] Reorganize profiling/processing-related code (#11812)
DarkLight1337 Jan 8, 2025
aba8d6e
[Doc] Move examples into categories (#11840)
hmellor Jan 8, 2025
6cd40a5
[Doc][4/N] Reorganize API Reference (#11843)
DarkLight1337 Jan 8, 2025
2f70249
[CI/Build][Bugfix] Fix CPU CI image clean up (#11836)
bigPYJ1151 Jan 8, 2025
88e020d
Merge pull request #350 from ROCm/upstream_merge_25_1_6
gshtras Jan 8, 2025
78f4590
[Bugfix][XPU] fix silu_and_mul (#11823)
yma11 Jan 8, 2025
ca47e17
[Misc] Move some model utils into vision file (#11848)
DarkLight1337 Jan 8, 2025
5984499
[Doc] Expand Multimodal API Reference (#11852)
DarkLight1337 Jan 8, 2025
47de882
[Misc]add some explanations for BlockHashType (#11847)
WangErXiao Jan 8, 2025
56fe4c2
[TPU][Quantization] TPU `W8A8` (#11785)
robertgshaw2-redhat Jan 8, 2025
526de82
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup f…
rasmith Jan 8, 2025
3db0caf
[Docs] Add Google Cloud Meetup (#11864)
simon-mo Jan 8, 2025
c040f0e
Revert nccl changes (#351)
gshtras Jan 8, 2025
615e4a5
[CI] Turn on basic correctness tests for V1 (#10864)
tlrmchlsmth Jan 9, 2025
1fe554b
treat do_lower_case in the same way as the sentence-transformers libr…
maxdebayser Jan 9, 2025
730e959
[Doc] Recommend uv and python 3.12 for quickstart guide (#11849)
mgoin Jan 9, 2025
d848800
[Misc] Move `print_*_once` from utils to logger (#11298)
DarkLight1337 Jan 9, 2025
a732900
[Doc] Intended links Python multiprocessing library (#11878)
guspan-tanadi Jan 9, 2025
310aca8
[perf]fix current stream (#11870)
youkaichao Jan 9, 2025
0bd1ff4
[Bugfix] Override dunder methods of placeholder modules (#11882)
DarkLight1337 Jan 9, 2025
1d967ac
[Bugfix] fix beam search input errors and latency benchmark script (#…
yeqcharlotte Jan 9, 2025
65097ca
[Doc] Add model development API Reference (#11884)
DarkLight1337 Jan 9, 2025
405eb8e
[platform] Allow platform specify attention backend (#11609)
wangxiyuan Jan 9, 2025
bd82872
[ci]try to fix flaky multi-step tests (#11894)
youkaichao Jan 9, 2025
9a22834
[Misc] Provide correct Pixtral-HF chat template (#11891)
DarkLight1337 Jan 9, 2025
3efdd2b
fp8 support (#352)
Concurrensee Jan 9, 2025
36f5303
[Docs] Add Modal to deployment frameworks (#11907)
charlesfrye Jan 9, 2025
c3cf54d
[Doc][5/N] Move Community and API Reference to the bottom (#11896)
DarkLight1337 Jan 10, 2025
b844b99
[VLM] Enable tokenized inputs for merged multi-modal processor (#11900)
DarkLight1337 Jan 10, 2025
3de2b1e
[Doc] Show default pooling method in a table (#11904)
DarkLight1337 Jan 10, 2025
cf5f000
[torch.compile] Hide KV cache behind torch.compile boundary (#11677)
heheda12345 Jan 10, 2025
ac2f3f7
[Bugfix] Validate lora adapters to avoid crashing server (#11727)
joerunde Jan 10, 2025
61af633
[BUGFIX] Fix `UnspecifiedPlatform` package name (#11916)
jikunshang Jan 10, 2025
d53575a
[ci] fix gh200 tests (#11919)
youkaichao Jan 10, 2025
d907be7
[misc] remove python function call for custom activation op (#11885)
cennn Jan 10, 2025
ef725fe
[platform] support pytorch custom op pluggable (#11328)
wangxiyuan Jan 10, 2025
d85c47d
Replace "online inference" with "online serving" (#11923)
hmellor Jan 10, 2025
241ad7b
[ci] Fix sampler tests (#11922)
youkaichao Jan 10, 2025
12664dd
[Doc] [1/N] Initial guide for merged multi-modal processor (#11925)
DarkLight1337 Jan 10, 2025
20410b2
[platform] support custom torch.compile backend key (#11318)
wangxiyuan Jan 10, 2025
482cdc4
[Doc] Rename offline inference examples (#11927)
hmellor Jan 10, 2025
f33e033
[Docs] Fix docstring in `get_ip` function (#11932)
KuntaiDu Jan 10, 2025
5959564
Doc fix in `benchmark_long_document_qa_throughput.py` (#11933)
KuntaiDu Jan 10, 2025
aa1e77a
[Hardware][CPU] Support MOE models on x86 CPU (#11831)
bigPYJ1151 Jan 10, 2025
46fa98c
[Misc] Clean up debug code in Deepseek-V3 (#11930)
Isotr0py Jan 10, 2025
8a57940
[Misc] Update benchmark_prefix_caching.py fixed example usage (#11920)
remimin Jan 10, 2025
d45cbe7
[Bugfix] Check that number of images matches number of <|image|> toke…
tjohnson31415 Jan 10, 2025
c9f09a4
[mypy] Fix mypy warnings in api_server.py (#11941)
frreiss Jan 11, 2025
899136b
[ci] fix broken distributed-tests-4-gpus (#11937)
youkaichao Jan 11, 2025
2118d05
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with in…
llsj14 Jan 11, 2025
c32a7c7
[Bugfix] fused_experts_impl wrong compute type for float32 (#11921)
shaochangxu Jan 11, 2025
7a3a83e
[CI/Build] Move model-specific multi-modal processing tests (#11934)
DarkLight1337 Jan 11, 2025
a991f7d
[Doc] Basic guide for writing unit tests for new models (#11951)
DarkLight1337 Jan 11, 2025
d697dc0
[Bugfix] Fix RobertaModel loading (#11940)
NickLucche Jan 11, 2025
4b657d3
[Model] Add cogagent model support vLLM (#11742)
sixsixcoder Jan 11, 2025
b25cfab
[V1] Avoid sending text prompt to core engine (#11963)
ywang96 Jan 12, 2025
43f3d9e
[CI/Build] Add markdown linter (#11857)
rafvasq Jan 12, 2025
f967e51
[Model] Initialize support for Deepseek-VL2 models (#11578)
Isotr0py Jan 12, 2025
8bddb73
[Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)
Akshat-Tripathi Jan 12, 2025
263a870
[Hardware][TPU] workaround fix for MoE on TPU (#11764)
avshalomman Jan 12, 2025
9597a09
[V1][Core][1/n] Logging and Metrics (#11962)
robertgshaw2-redhat Jan 12, 2025
d14e98d
[Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685)
Isotr0py Jan 13, 2025
619ae26
[V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973)
robertgshaw2-redhat Jan 13, 2025
f7b3ba8
[MISC] fix typo in kv transfer send recv test (#11983)
yyccli Jan 13, 2025
9dd02d8
[Bug] Fix usage of `.transpose()` and `.view()` consecutively. (#11979)
liaoyanqing666 Jan 13, 2025
80ea3af
[CI][Spec Decode] fix: broken test for EAGLE model (#11972)
llsj14 Jan 13, 2025
cf6bbcb
[Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)
Concurrensee Jan 13, 2025
c3f05b0
[Misc]Minor Changes about Worker (#11555)
noemotiovon Jan 13, 2025
89ce62a
[platform] add ray_device_key (#11948)
youkaichao Jan 13, 2025
5340a30
Fix Max Token ID for Qwen-VL-Chat (#11980)
alex-jw-brooks Jan 13, 2025
0f8cafe
[Kernel] unified_attention for Attention.forward (#11967)
heheda12345 Jan 13, 2025
cd82499
[Doc][V1] Update model implementation guide for V1 support (#11998)
ywang96 Jan 13, 2025
e8c23ff
[Doc] Organise installation documentation into categories and tabs (#…
hmellor Jan 13, 2025
458e63a
[platform] add device_control env var (#12009)
youkaichao Jan 13, 2025
a7d5968
[Platform] Move get_punica_wrapper() function to Platform (#11516)
shen-shanshan Jan 13, 2025
c6db213
bugfix: Fix signature mismatch in benchmark's `get_tokenizer` functio…
e1ijah1 Jan 13, 2025
ce53f46
Merge remote-tracking branch 'upstream/main'
gshtras Jan 13, 2025
5a51290
Using list
gshtras Jan 13, 2025
079750e
Revert "[misc] improve memory profiling (#11809)"
gshtras Jan 13, 2025
113274a
Multi-lingual P3L (#356)
Alexei-V-Ivanov-AMD Jan 13, 2025
043c93d
Trying to make scales work with compileable attention
gshtras Jan 13, 2025
16f8680
Docs lint
gshtras Jan 14, 2025
eb4abfd
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_13
gshtras Jan 14, 2025
5976f48
Merge pull request #358 from ROCm/upstream_merge_25_01_13
gshtras Jan 14, 2025
7b8c3be
Merge remote-tracking branch 'origin/main' into main-to-llama-fp8
vllmellm Jan 15, 2025
ed572dd
Merge remote-tracking branch 'origin/main' into main-to-llama-fp8
vllmellm Jan 15, 2025
36999a2
Merge remote-tracking branch 'origin/llama_fp8_12062024' into main-to…
vllmellm Jan 15, 2025
02962b6
linter formatting bug fixes
vllmellm Jan 15, 2025
7c05f3e
inherit config file updates under fused_moe from main branch.
vllmellm Jan 15, 2025
af684f9
match tests for the MOE layers with main.
vllmellm Jan 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
24 changes: 24 additions & 0 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import argparse
import os

template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""

parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()

filename = os.path.basename(args.wheel)

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
16 changes: 11 additions & 5 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
steps:
- label: "Wait for container to be ready"
key: wait-for-container-image
agents:
queue: A100
plugins:
Expand All @@ -10,18 +11,17 @@ steps:
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand Down Expand Up @@ -49,9 +49,10 @@ steps:
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand All @@ -65,13 +66,18 @@ steps:
- VLLM_USAGE_SOURCE
- HF_TOKEN

#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

Expand Down
48 changes: 46 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:
- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
Expand All @@ -18,11 +18,55 @@ steps:
- label: "Build wheel - CUDA 11.8"
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- block: "Build release image"
depends_on: ~
key: block-release-image-build

- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"

- block: "Build CPU release image"
key: block-cpu-release-image-build
depends_on: ~

- label: "Build and publish CPU release image"
depends_on: block-cpu-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
env:
DOCKER_BUILDKIT: "1"
37 changes: 20 additions & 17 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,63 +9,60 @@ CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}

# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2

function cpu_tests() {
set -e
export NUMA_NODE=$2

# offline inference
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"
python3 examples/offline_inference/basic.py"

# Run basic model test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -r vllm/requirements-test.txt
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# Run compressed-tensor test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"

# online inference
docker exec cpu-test-"$NUMA_NODE" bash -c "
# online serving
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
Expand All @@ -78,6 +75,12 @@ function cpu_tests() {
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"

# Run multi-lora tests
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/lora/test_qwen2vl.py"
}

# All of CPU tests are expected to be finished less than 25 mins.
Expand Down
28 changes: 28 additions & 0 deletions .buildkite/run-gh200-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash

# This script build the GH200 docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Skip the new torch installation during build since we are using the specified version for arm64 in the Dockerfile
python3 use_existing_torch.py

# Try building the docker image
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--platform "linux/arm64" \
-t gh200-test \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"

# Setup cleanup
remove_docker_container() { docker rm -f gh200-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and test offline inference
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/basic.py
'
2 changes: 1 addition & 1 deletion .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
53 changes: 27 additions & 26 deletions .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,18 @@
# This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -e
set -v

image_name="neuron/vllm-ci"
container_name="neuron_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"

HF_CACHE="$(realpath ~)/huggingface"
mkdir -p "${HF_CACHE}"
HF_MOUNT="/root/.cache/huggingface"

NEURON_COMPILE_CACHE_URL="$(realpath ~)/neuron_compile_cache"
mkdir -p "${NEURON_COMPILE_CACHE_URL}"
NEURON_COMPILE_CACHE_MOUNT="/root/.cache/neuron_compile_cache"

# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
Expand All @@ -13,41 +25,30 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
docker image prune -f
docker system prune -f
rm -rf "${HF_MOUNT:?}/*"
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
echo "$current_time" > /tmp/neuron-docker-build-timestamp
fi
else
date "+%s" > /tmp/neuron-docker-build-timestamp
fi

docker build -t neuron -f Dockerfile.neuron .
docker build -t "${image_name}" -f Dockerfile.neuron .

# Setup cleanup
remove_docker_container() { docker rm -f neuron || true; }
remove_docker_container() {
docker image rm -f "${image_name}" || true;
}
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w '%{http_code}' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
-v "${NEURON_COMPILE_CACHE_URL}:${NEURON_COMPILE_CACHE_MOUNT}" \
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
--name "${container_name}" \
${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py"
2 changes: 1 addition & 1 deletion .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference.py
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic.py
11 changes: 10 additions & 1 deletion .buildkite/run-tpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,13 @@ remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e "HF_TOKEN=$HF_TOKEN" --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
&& python3 -m pip install pytest \
&& python3 -m pip install lm_eval[api]==0.4.4 \
&& pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py \
&& pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
&& python3 /workspace/vllm/tests/tpu/test_compilation.py \
&& python3 /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py"
7 changes: 5 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,8 @@ remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference/basic.py
python3 examples/offline_inference/cli.py -tp 2
'
Loading
Loading