Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 25 1 6 #350

Merged
merged 202 commits into from
Jan 8, 2025
Merged
Changes from 1 commit
Commits
Show all changes
202 commits
Select commit Hold shift + click to select a range
efbce85
[misc] Layerwise profile updates (#10242)
varun-sundar-rabindranath Dec 16, 2024
551603f
[core] overhaul memory profiling and fix backward compatibility (#10511)
youkaichao Dec 16, 2024
35ffa68
[Docs] hint to enable use of GPU performance counters in profiling to…
bk-TurbaAI Dec 16, 2024
c301616
[ci][tests] add gh200 tests (#11244)
youkaichao Dec 16, 2024
88a412e
[torch.compile] fast inductor (#11108)
youkaichao Dec 17, 2024
35bae11
fix gh200 tests on main (#11246)
youkaichao Dec 17, 2024
0064f69
[CI] Add test case with JSON schema using references + use xgrammar b…
mgoin Dec 17, 2024
66d4b16
[Frontend] Add OpenAI API support for input_audio (#11027)
kylehh Dec 17, 2024
59c9b6e
[V1][VLM] Proper memory profiling for image language models (#11210)
ywang96 Dec 17, 2024
e88db68
[Platform] platform agnostic for EngineArgs initialization (#11225)
wangxiyuan Dec 17, 2024
2bfdbf2
[V1][Core] Use weakref.finalize instead of atexit (#11242)
tlrmchlsmth Dec 17, 2024
02222a0
[Misc] Kernel Benchmark for `RMSNorm` (#11241)
ywang96 Dec 17, 2024
f9ecbb1
[Misc] Allow passing logits_soft_cap for xformers backend (#11252)
Isotr0py Dec 17, 2024
2d1b9ba
[Bugfix] Fix request cancellation without polling (#11190)
joerunde Dec 17, 2024
c77eb8a
[Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264)
mgoin Dec 18, 2024
bf8717e
[V1] Prefix caching for vision language models (#11187)
comaniac Dec 18, 2024
866fa45
[Bugfix] Restore support for larger block sizes (#11259)
kzawora-intel Dec 18, 2024
8b79f9e
[Bugfix] Fix guided decoding with tokenizer mode mistral (#11046)
wallashss Dec 18, 2024
f04e407
[MISC][XPU]update ipex link for CI fix (#11278)
yma11 Dec 18, 2024
60508ff
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
dsikka Dec 18, 2024
996aa70
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests (#11263)
Isotr0py Dec 18, 2024
362cff1
[CI][Misc] Remove Github Action Release Workflow (#11274)
simon-mo Dec 18, 2024
f954fe0
[FIX] update openai version (#11287)
jikunshang Dec 18, 2024
ca5f54a
[Bugfix] fix minicpmv test (#11304)
joerunde Dec 18, 2024
fdea8ec
[V1] VLM - enable processor cache by default (#11305)
alexm-redhat Dec 18, 2024
5a9da2e
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2…
tlrmchlsmth Dec 19, 2024
17ca964
[Model] IBM Granite 3.1 (#11307)
tjohnson31415 Dec 19, 2024
a30482f
[CI] Expand test_guided_generate to test all backends (#11313)
mgoin Dec 19, 2024
c6b0a7d
[V1] Simplify prefix caching logic by removing `num_evictable_compute…
heheda12345 Dec 19, 2024
6142ef0
[VLM] Merged multimodal processor for Qwen2-Audio (#11303)
DarkLight1337 Dec 19, 2024
8936316
[Kernel] Refactor Cutlass c3x (#10049)
varun-sundar-rabindranath Dec 19, 2024
f26c4ae
[Misc] Optimize ray worker initialization time (#11275)
ruisearch42 Dec 19, 2024
9835673
[misc] benchmark_throughput : Add LoRA (#11267)
varun-sundar-rabindranath Dec 19, 2024
5aef498
[Feature] Add load generation config from model (#11164)
liuyanyi Dec 19, 2024
a0f7d53
[Bugfix] Cleanup Pixtral HF code (#11333)
DarkLight1337 Dec 19, 2024
6c7f881
[Model] Add JambaForSequenceClassification model (#10860)
yecohn Dec 19, 2024
7379b3d
[V1] Fix multimodal profiling for `Molmo` (#11325)
ywang96 Dec 19, 2024
e24113a
[Model] Refactor Qwen2-VL to use merged multimodal processor (#11258)
Isotr0py Dec 19, 2024
cdf22af
[Misc] Clean up and consolidate LRUCache (#11339)
DarkLight1337 Dec 19, 2024
276738c
[Bugfix] Fix broken CPU compressed-tensors test (#11338)
Isotr0py Dec 19, 2024
e461c26
[Misc] Remove unused vllm/block.py (#11336)
Ghjk94522 Dec 19, 2024
a985f7a
[CI] Adding CPU docker pipeline (#11261)
zhouyuan Dec 19, 2024
48edab8
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10…
Akashcodes732 Dec 20, 2024
7801f56
[ci][gh200] dockerfile clean up (#11351)
youkaichao Dec 20, 2024
b880ffb
[Misc] Add tqdm progress bar during graph capture (#11349)
mgoin Dec 20, 2024
86c2d8f
[Bugfix] Fix spec decoding when seed is none in a batch (#10863)
wallashss Dec 20, 2024
c954f21
[misc] add early error message for custom ops (#11355)
youkaichao Dec 20, 2024
1ecc645
[doc] backward compatibility for 0.6.4 (#11359)
youkaichao Dec 20, 2024
04139ad
[V1] Fix profiling for models with merged input processor (#11370)
ywang96 Dec 20, 2024
7c7aa37
[CI/Build] fix pre-compiled wheel install for exact tag (#11373)
dtrifiro Dec 20, 2024
995f562
[Core] Loading model from S3 using RunAI Model Streamer as optional l…
omer-dayan Dec 20, 2024
d573aea
[Bugfix] Don't log OpenAI field aliases as ignored (#11378)
mgoin Dec 20, 2024
5d2248d
[doc] explain nccl requirements for rlhf (#11381)
youkaichao Dec 20, 2024
47a0b61
Add ray[default] to wget to run distributed inference out of box (#11…
Jeffwan Dec 20, 2024
dd2b563
[V1][Bugfix] Skip hashing empty or None mm_data (#11386)
WoosukKwon Dec 21, 2024
51ff216
[Bugfix] update should_ignore_layer (#11354)
horheynm Dec 21, 2024
584f0ae
[V1] Make AsyncLLMEngine v1-v0 opaque (#11383)
rickyyx Dec 21, 2024
c2d1b07
[Bugfix] Fix issues for `Pixtral-Large-Instruct-2411` (#11393)
ywang96 Dec 21, 2024
29c7489
[CI] Fix flaky entrypoint tests (#11403)
ywang96 Dec 22, 2024
4a91397
[cd][release] add pypi index for every commit and nightly build (#11404)
youkaichao Dec 22, 2024
72d9c31
[cd][release] fix race conditions (#11407)
youkaichao Dec 22, 2024
f1d1bf6
[Bugfix] Fix fully sharded LoRAs with Mixtral (#11390)
n1hility Dec 22, 2024
048fc57
[CI] Unboock H100 Benchmark (#11419)
simon-mo Dec 22, 2024
f30581c
[misc][perf] remove old code (#11425)
youkaichao Dec 23, 2024
e51719a
mypy type checking for vllm/worker (#11418)
lucas-tucker Dec 23, 2024
5bfb30a
[Bugfix] Fix CFGGuide and use outlines for grammars that can't conver…
mgoin Dec 23, 2024
2e72668
[Bugfix] torch nightly version in ROCm installation guide (#11423)
terrytangyuan Dec 23, 2024
b866cdb
[Misc] Add assertion and helpful message for marlin24 compressed mode…
dsikka Dec 23, 2024
8cef6e0
[Misc] add w8a8 asym models (#11075)
dsikka Dec 23, 2024
63afbe9
[CI] Expand OpenAI test_chat.py guided decoding tests (#11048)
mgoin Dec 23, 2024
60fb4f3
[Bugfix] Add kv cache scales to gemma2.py (#11269)
mgoin Dec 23, 2024
94d545a
[Doc] Fix typo in the help message of '--guided-decoding-backend' (#1…
yansh97 Dec 23, 2024
32aa205
[Docs] Convert rST to MyST (Markdown) (#11145)
rafvasq Dec 23, 2024
a491d6f
[V1] TP Ray executor (#11107)
ruisearch42 Dec 23, 2024
4f074fb
[Misc]Suppress irrelevant exception stack trace information when CUDA…
shiquan1988 Dec 24, 2024
9edca6b
[Frontend] Online Pooling API (#11457)
DarkLight1337 Dec 24, 2024
b1b1038
[Bugfix] Fix Qwen2-VL LoRA weight loading (#11430)
jeejeelee Dec 24, 2024
7a5286c
[Bugfix][Hardware][CPU] Fix CPU `input_positions` creation for text-o…
Isotr0py Dec 24, 2024
461cde2
[OpenVINO] Fixed installation conflicts (#11458)
ilya-lavrenov Dec 24, 2024
5c79632
[attn][tiny fix] fix attn backend in MultiHeadAttention (#11463)
MengqingCao Dec 24, 2024
196c34b
[Misc] Move weights mapper (#11443)
jeejeelee Dec 24, 2024
409475a
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 (#11435)
terrytangyuan Dec 24, 2024
3f3e92e
[Model] Automatic conversion of classification and reward models (#11…
DarkLight1337 Dec 24, 2024
9832e55
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (#1…
ruisearch42 Dec 25, 2024
fc60166
[Misc] Update disaggregation benchmark scripts and test logs (#11456)
Jeffwan Dec 25, 2024
b689ada
[Frontend] Enable decord to load video from base64 (#11492)
DarkLight1337 Dec 25, 2024
6ad909f
[Doc] Improve GitHub links (#11491)
DarkLight1337 Dec 25, 2024
51a624b
[Misc] Move some multimodal utils to modality-specific modules (#11494)
DarkLight1337 Dec 26, 2024
dbeac95
Mypy checking for vllm/compilation (#11496)
lucas-tucker Dec 26, 2024
aa25985
[Misc][LoRA] Fix LoRA weight mapper (#11495)
jeejeelee Dec 26, 2024
7492a36
[Doc] Add `QVQ` and `QwQ` to the list of supported models (#11509)
ywang96 Dec 26, 2024
dcb1a94
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 …
sroy745 Dec 26, 2024
f57ee56
[Model] Modify MolmoForCausalLM MLP (#11510)
jeejeelee Dec 26, 2024
eec906d
[Misc] Add placeholder module (#11501)
DarkLight1337 Dec 26, 2024
b85a977
[Doc] Add video example to openai client for multimodal (#11521)
Isotr0py Dec 26, 2024
720b10f
[1/N] API Server (Remove Proxy) (#11529)
robertgshaw2-redhat Dec 26, 2024
2072924
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quanti…
mgoin Dec 26, 2024
55fb97f
[2/N] API Server: Avoid ulimit footgun (#11530)
robertgshaw2-redhat Dec 26, 2024
f49777b
Deepseek v3 (#11502)
simon-mo Dec 27, 2024
82d24f7
[Docs] Document Deepseek V3 support (#11535)
simon-mo Dec 27, 2024
0c0c201
Update openai_compatible_server.md (#11536)
robertgshaw2-redhat Dec 27, 2024
371d04d
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394)
WoosukKwon Dec 27, 2024
81b979f
[V1] Fix yapf (#11538)
WoosukKwon Dec 27, 2024
46d4359
[CI] Fix broken CI (#11543)
robertgshaw2-redhat Dec 27, 2024
eb881ed
[misc] fix typing (#11540)
youkaichao Dec 27, 2024
1b875a0
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (…
robertgshaw2-redhat Dec 27, 2024
2339d59
[BugFix] Fix quantization for all other methods (#11547)
robertgshaw2-redhat Dec 27, 2024
6c6f7fe
[Platform] Move model arch check to platform (#11503)
MengqingCao Dec 27, 2024
d003f3e
Update deploying_with_k8s.md with AMD ROCm GPU example (#11465)
AlexHe99 Dec 27, 2024
2c9b8ea
[Bugfix] Fix TeleChat2ForCausalLM weights mapper (#11546)
jeejeelee Dec 27, 2024
7af553e
[Misc] Abstract the logic for reading and writing media content (#11527)
DarkLight1337 Dec 27, 2024
5ce4627
[Doc] Add xgrammar in doc (#11549)
Chen-0210 Dec 27, 2024
1014180
[VLM] Support caching in merged multi-modal processor (#11396)
DarkLight1337 Dec 27, 2024
55509c2
[MODEL] LoRA support for Jamba model (#11209)
ErezSC42 Dec 27, 2024
0240402
[Misc]Add BNB quantization for MolmoForCausalLM (#11551)
jeejeelee Dec 27, 2024
dde1fa1
[Misc] Improve BNB loader to handle mixture of sharded and merged wei…
Isotr0py Dec 27, 2024
ac79799
[Bugfix] Fix for ROCM compressed tensor support (#11561)
selalipop Dec 27, 2024
a607312
[Doc] Update mllama example based on official doc (#11567)
heheda12345 Dec 28, 2024
df04dff
[V1] [4/N] API Server: ZMQ/MP Utilities (#11541)
robertgshaw2-redhat Dec 28, 2024
b5cbe8e
[Bugfix] Last token measurement fix (#11376)
rajveerb Dec 28, 2024
d34be24
[Model] Support InternLM2 Reward models (#11571)
Isotr0py Dec 28, 2024
b7dcc00
[Model] Remove hardcoded image tokens ids from Pixtral (#11582)
ywang96 Dec 28, 2024
59d6bb4
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version…
hj-wei Dec 28, 2024
42bb201
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor (#11581)
WoosukKwon Dec 28, 2024
d427e5c
[Doc] Minor documentation fixes (#11580)
DarkLight1337 Dec 28, 2024
328841d
[bugfix] interleaving sliding window for cohere2 model (#11583)
youkaichao Dec 28, 2024
4fb8e32
[V1] [5/N] API Server: unify `Detokenizer` and `EngineCore` input (#…
robertgshaw2-redhat Dec 28, 2024
32b4c63
[Doc] Convert list tables to MyST (#11594)
DarkLight1337 Dec 29, 2024
dba4d9d
[v1][bugfix] fix cudagraph with inplace buffer assignment (#11596)
youkaichao Dec 29, 2024
faef77c
[Misc] KV cache transfer connector registry (#11481)
KuntaiDu Dec 29, 2024
0aa38d1
Remove print statement in DeepseekScalingRotaryEmbedding (#11604)
mgoin Dec 29, 2024
3682e33
[v1] fix compilation cache (#11598)
youkaichao Dec 30, 2024
628ec6c
[Docker] bump up neuron sdk v2.21 (#11593)
liangfu Dec 30, 2024
970d6d0
[Build][Kernel] Update CUTLASS to v3.6.0 (#11607)
tlrmchlsmth Dec 30, 2024
5dbf854
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618)
bigPYJ1151 Dec 30, 2024
b12e87f
[platforms] enable platform plugins (#11602)
youkaichao Dec 30, 2024
8d9b672
[VLM] Abstract out multi-modal data parsing in merged processor (#11620)
DarkLight1337 Dec 30, 2024
5886aa4
[V1] [6/N] API Server: Better Shutdown (#11586)
robertgshaw2-redhat Dec 30, 2024
36e7670
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseMod…
whyiug Dec 30, 2024
ccb1aab
[benchmark] Remove dependency for H100 benchmark step (#11572)
khluu Dec 30, 2024
a2a40bc
[Model][LoRA]LoRA support added for MolmoForCausalLM (#11439)
ayylemao Dec 31, 2024
74fa1d1
[Bugfix] Fix OpenAI parallel sampling when using xgrammar (#11637)
mgoin Dec 31, 2024
82c49d3
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (#6909)
JohnGiorgi Dec 31, 2024
2c57188
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots …
sakunkun Dec 31, 2024
8c3230d
[V1] Simpify vision block hash for prefix caching by removing offset …
heheda12345 Dec 31, 2024
e7c7c5e
[V1][VLM] V1 support for selected single-image models. (#11632)
ywang96 Dec 31, 2024
0c6f998
[Benchmark] Add benchmark script for CPU offloading (#11533)
ApostaC Jan 1, 2025
4db72e5
[Bugfix][Refactor] Unify model management in frontend (#11660)
joerunde Jan 1, 2025
365801f
[VLM] Add max-count checking in data parser for single image models (…
DarkLight1337 Jan 1, 2025
11d8a09
[Misc] Optimize Qwen2-VL LoRA test (#11663)
jeejeelee Jan 1, 2025
f962f42
[Misc] Replace space with - in the file names (#11667)
houseroad Jan 1, 2025
6d70198
[Doc] Fix typo (#11666)
serihiro Jan 1, 2025
7300144
[V1] Implement Cascade Attention (#11635)
WoosukKwon Jan 1, 2025
a115ac4
[VLM] Move supported limits and max tokens to merged multi-modal proc…
DarkLight1337 Jan 1, 2025
23c1b10
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (#…
DarkLight1337 Jan 2, 2025
b6087a6
[mypy] Pass type checking in vllm/inputs (#11680)
CloseChoice Jan 2, 2025
8c38ee7
[VLM] Merged multi-modal processor for LLaVA-NeXT (#11682)
DarkLight1337 Jan 2, 2025
84c35c3
According to vllm.EngineArgs, the name should be distributed_executor…
chunyang-wen Jan 2, 2025
2f38518
[Bugfix] Free cross attention block table for preempted-for-recompute…
kathyyu-google Jan 2, 2025
b55ed6e
[V1][Minor] Optimize token_ids_cpu copy (#11692)
WoosukKwon Jan 2, 2025
187e329
[Bugfix] Change kv scaling factor by param json on nvidia gpu (#11688)
bjmsong Jan 2, 2025
5dba257
Resolve race conditions in Marlin kernel (#11493)
wchen61 Jan 2, 2025
68d3780
[Misc] Minimum requirements for SageMaker compatibility (#11576)
nathan-az Jan 2, 2025
2f1e8e8
Update default max_num_batch_tokens for chunked prefill (#11694)
SachinVarghese Jan 3, 2025
07064cb
[Bugfix] Check chain_speculative_sampling before calling it (#11673)
houseroad Jan 3, 2025
fd3a62a
[perf-benchmark] Fix dependency for steps in benchmark pipeline (#11710)
khluu Jan 3, 2025
e1a5c2f
[Model] Whisper model implementation (#11280)
aurickq Jan 3, 2025
80c751e
[V1] Simplify Shutdown (#11659)
robertgshaw2-redhat Jan 3, 2025
61fed92
[Bugfix] Fix ColumnParallelLinearWithLoRA slice (#11708)
zinccat Jan 3, 2025
1543914
[V1] Improve TP>1 Error Handling + Stack Trace (#11721)
robertgshaw2-redhat Jan 3, 2025
a655eb3
[Misc]Add BNB quantization for Qwen2VL (#11719)
jeejeelee Jan 3, 2025
bf0d97d
Update requirements-tpu.txt to support python 3.9 and 3.11 (#11695)
mgoin Jan 3, 2025
ad0d567
[V1] Chore: cruft removal (#11724)
robertgshaw2-redhat Jan 3, 2025
e5d7ed0
[V1] log GPU blocks num for MultiprocExecutor (#11656)
WangErXiao Jan 4, 2025
9c93636
Update tool_calling.md (#11701)
Bryce1010 Jan 4, 2025
d1d4939
Update bnb.md with example for OpenAI (#11718)
bet0x Jan 4, 2025
fbf2564
[V1] Add `RayExecutor` support for `AsyncLLM` (api server) (#11712)
jikunshang Jan 4, 2025
d91457d
[V1] Add kv cache utils tests. (#11513)
xcnick Jan 4, 2025
300acb8
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-…
yanburman Jan 4, 2025
eed11eb
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-On…
DarkLight1337 Jan 4, 2025
ba214df
[Bugfix] Fix precision error in LLaVA-NeXT (#11735)
DarkLight1337 Jan 4, 2025
65c0892
[Model] Remove unnecessary weight initialization logic (#11736)
DarkLight1337 Jan 4, 2025
4783143
[Bugfix][V1] Fix test_kv_cache_utils.py (#11738)
jeejeelee Jan 4, 2025
4068f4b
[MISC] Replace c10::optional with std::optional (#11730)
houseroad Jan 5, 2025
635b897
[distributed] remove pynccl's redundant stream (#11744)
cennn Jan 5, 2025
eba1717
fix: [doc] fix typo (#11751)
RuixiangMa Jan 5, 2025
33fc1e2
[Frontend] Improve `StreamingResponse` Exception Handling (#11752)
robertgshaw2-redhat Jan 5, 2025
9e764e7
[distributed] remove pynccl's redundant change_state (#11749)
cennn Jan 6, 2025
402d378
[Doc] [1/N] Reorganize Getting Started section (#11645)
DarkLight1337 Jan 6, 2025
408e560
[Bugfix] Remove block size constraint (#11723)
comaniac Jan 6, 2025
06bfb51
[V1] Add BlockTable class (#11693)
WoosukKwon Jan 6, 2025
f8fcca1
[Misc] Fix typo for valid_tool_parses (#11753)
ruisearch42 Jan 6, 2025
022c5c6
[V1] Refactor get_executor_cls (#11754)
ruisearch42 Jan 6, 2025
9c74971
[mypy] Forward pass function type hints in lora (#11740)
lucas-tucker Jan 6, 2025
2a622d7
k8s-config: Update the secret to use stringData (#11679)
surajssd Jan 6, 2025
996357e
[VLM] Separate out profiling-related logic (#11746)
DarkLight1337 Jan 6, 2025
ee77fdb
[Doc][2/N] Reorganize Models and Usage sections (#11755)
DarkLight1337 Jan 6, 2025
9279b9f
[Bugfix] Fix max image size for LLaVA-Onevision (#11769)
ywang96 Jan 6, 2025
4ca5d40
[doc] explain how to add interleaving sliding window support (#11771)
youkaichao Jan 6, 2025
4773c29
Merge remote-tracking branch 'upstream/main'
gshtras Jan 6, 2025
267c1a1
format
gshtras Jan 6, 2025
97067c0
Merge branch 'main' into upstream_merge_25_1_6
gshtras Jan 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[torch.compile] fast inductor (vllm-project#11108)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
youkaichao and tlrmchlsmth authored Dec 17, 2024
commit 88a412ed3d964de3443c42a6a35108115ee0ad25
213 changes: 210 additions & 3 deletions vllm/compilation/backends.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
import ast
import copy
import dataclasses
import os
import pprint
import time
from collections import defaultdict
from contextlib import ExitStack
from typing import Any, Callable, Dict, List, Optional, Sequence, Set, Tuple
from unittest.mock import patch
@@ -21,6 +25,122 @@
logger = init_logger(__name__)


class InductorHashCache:
"""
Disk format: a Python list of tuples, each tuple is
(runtime_shape, graph_index, hash_str)
We use list of tuple for readability.

In-memory format: a defaultdict of dict, where the key is
runtime_shape, and the value is a dict of graph_index to hash_str.

The data is essentially `Dict[Optional[int], Dict[int, str]]`,
we don't use json here because json doesn't support int as key.

TODO: better off-the-shelf solution to serialize the data?
"""

def __init__(self, cache_dir: str, disabled: bool = False):
self.cache: defaultdict = defaultdict(dict)
self.disabled = disabled
self.cache_dir = cache_dir
self.cache_file_path = os.path.join(cache_dir,
"inductor_hash_cache.py")
if disabled:
return
# set flags so that Inductor and Triton store their cache
# in the cache_dir, then users only need to copy the cache_dir
# to another machine to reuse the cache.
inductor_cache = os.path.join(cache_dir, "inductor_cache")
os.makedirs(inductor_cache, exist_ok=True)
os.environ["TORCHINDUCTOR_CACHE_DIR"] = inductor_cache
triton_cache = os.path.join(cache_dir, "triton_cache")
os.makedirs(triton_cache, exist_ok=True)
os.environ["TRITON_CACHE_DIR"] = triton_cache
if os.path.exists(self.cache_file_path):
with open(self.cache_file_path) as f:
self.deserialize(f.read())

def deserialize(self, data: str):
# we use ast.literal_eval to parse the data
# because it is a safe way to parse Python literals.
# do not use eval(), it is unsafe.
list_data = ast.literal_eval(data)
for runtime_shape, graph_index, hash_str in list_data:
self.cache[runtime_shape][graph_index] = hash_str

def serialize(self) -> str:
data = []
for runtime_shape, graph_index_to_hash_str in self.cache.items():
for graph_index, hash_str in graph_index_to_hash_str.items():
data.append((runtime_shape, graph_index, hash_str))
printer = pprint.PrettyPrinter(indent=4)
return printer.pformat(data)

def save_to_file(self):
if self.disabled:
return
with open(self.cache_file_path, "w") as f:
f.write(self.serialize())

def __contains__(self, key: Tuple[Optional[int], int]) -> bool:
if self.disabled:
return False
runtime_shape, graph_index = key
return runtime_shape in self.cache and graph_index in self.cache[
runtime_shape]

def __getitem__(self, key: Tuple[Optional[int], int]) -> str:
if self.disabled:
raise KeyError("cannot read from disabled cache")
runtime_shape, graph_index = key
return self.cache[runtime_shape][graph_index]

def __setitem__(self, key: Tuple[Optional[int], int], value: str):
# setitem for disabled cache is fine, because we
# don't actually write to the disk
runtime_shape, graph_index = key
self.cache[runtime_shape][graph_index] = value


class AlwaysHitShapeEnv:
"""
Why do we need this class:

For normal `torch.compile` usage, every compilation will have
one Dynamo bytecode compilation and one Inductor compilation.
The Inductor compilation happens under the context of the
Dynamo bytecode compilation, and that context is used to
determine the dynamic shape information, etc.

For our use case, we only run Dynamo bytecode compilation once,
and run Inductor compilation multiple times with different shapes
plus a general shape. The compilation for specific shapes happens
outside of the context of the Dynamo bytecode compilation. At that
time, we don't have shape environment to provide to Inductor, and
it will fail the Inductor code cache lookup.

By providing a dummy shape environment that always hits, we can
make the Inductor code cache lookup always hit, and we can
compile the graph for different shapes as needed.

The following dummy methods are obtained by trial-and-error
until it works.
"""

def __init__(self) -> None:
self.guards: List[Any] = []

def evaluate_guards_expression(self, *args, **kwargs):
return True

def get_pruned_guards(self, *args, **kwargs):
return []

def produce_guards_expression(self, *args, **kwargs):
return ""


def wrap_inductor(graph,
example_inputs,
additional_inductor_config,
@@ -55,9 +175,93 @@ def wrap_inductor(graph,
# inductor can inplace modify the graph, so we need to copy it
# see https://github.com/pytorch/pytorch/issues/138980
graph = copy.deepcopy(graph)
compiled_graph = compile_fx(graph,
example_inputs,
config_patches=current_config)

cache_data = compilation_config.inductor_hash_cache
if (runtime_shape, graph_index) in cache_data:
# we compiled this graph before
# so we can directly lookup the compiled graph via hash
hash_str = cache_data[(runtime_shape, graph_index)]
if graph_index == 0:
# adds some info logging for the first graph
logger.info(
"Directly lookup the graph for shape %s from the cache",
str(runtime_shape)) # noqa
logger.debug(
"directly lookup the %s-th graph for shape %s via hash %s",
graph_index, str(runtime_shape), hash_str)
from torch._inductor.codecache import FxGraphCache
with patch("torch._inductor.codecache.FxGraphCache._get_shape_env",
lambda *args, **kwargs: AlwaysHitShapeEnv()):
inductor_compiled_graph = FxGraphCache._lookup_graph(
hash_str, example_inputs, True, False)
assert inductor_compiled_graph is not None, (
"Inductor cache lookup failed. Please remove"
f"the cache file {compilation_config.inductor_hash_cache.cache_file_path} and try again." # noqa
)

# Inductor calling convention (function signature):
# f(list) -> tuple
# Dynamo calling convention (function signature):
# f(*args) -> Any

# need to know if the graph returns a tuple
from torch._inductor.compile_fx import graph_returns_tuple
returns_tuple = graph_returns_tuple(graph)

# this is the graph we return to Dynamo to run
def compiled_graph(*args):
# convert args to list
list_args = list(args)
graph_output = inductor_compiled_graph(list_args)
# unpack the tuple if needed
if returns_tuple:
return graph_output
else:
return graph_output[0]
else:
# it's the first time we compile this graph
# the assumption is that we don't have nested Inductor compilation.
# compiled_fx_graph_hash will only be called once, and we can hook
# it to get the hash of the compiled graph directly.
from torch._inductor.codecache import compiled_fx_graph_hash

def hijack_compiled_fx_graph_hash(*args, **kwargs):
out = compiled_fx_graph_hash(*args, **kwargs)
# store the hash in the cache
nonlocal cache_data
cache_data[(runtime_shape, graph_index)] = out[0]
if graph_index == 0:
# adds some info logging for the first graph
logger.info("Cache the graph of shape %s for later use",
str(runtime_shape))
logger.debug("store the %s-th graph for shape %s via hash %s",
graph_index, str(runtime_shape), out[0])
return out

def _check_can_cache(*args, **kwargs):
# no error means it can be cached.
# Inductor refuses to cache the graph outside of Dynamo
# tracing context, and also disables caching for graphs
# with high-order ops.
# For vLLM, in either case, we want to cache the graph.
# see https://github.com/pytorch/pytorch/blob/9f5ebf3fc609105a74eab4ccc24932d6353ff566/torch/_inductor/codecache.py#L1221 # noqa
return

def _get_shape_env():
return AlwaysHitShapeEnv()

with patch(# for hijacking the hash of the compiled graph
"torch._inductor.codecache.compiled_fx_graph_hash",
hijack_compiled_fx_graph_hash), \
patch(# for providing a dummy shape environment
"torch._inductor.codecache.FxGraphCache._get_shape_env",
_get_shape_env), \
patch(# for forcing the graph to be cached
"torch._inductor.codecache.FxGraphCache._check_can_cache",
_check_can_cache):
compiled_graph = compile_fx(graph,
example_inputs,
config_patches=current_config)

# after compiling the last graph, record the end time
if graph_index == num_graphs - 1:
@@ -457,6 +661,9 @@ def __call__(self, *args) -> Any:

# finished compilations for all required shapes
if self.is_last_graph and not self.to_be_compiled_sizes:

# save the hash of the inductor graph for the next run
self.compilation_config.inductor_hash_cache.save_to_file()
end_monitoring_torch_compile(self.vllm_config)

if not entry.use_cudagraph:
Loading