Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MFM-20250115] Merge from ROCm/main to llama_fp8 #360

Merged
merged 537 commits into from
Jan 15, 2025

Conversation

tjtanaa
Copy link

@tjtanaa tjtanaa commented Jan 15, 2025

This is a PR to merge from ROCm/main to llama_fp8

Notes:
The test cases in tests/kernels/test_moe.py have been updated based on the main branch. However, these test cases do not take the MOE_SHUFFLE environment variable into account, unlike the llama-fp8 branch. If shuffling is required, we will need to revert the test cases to the version from the llama-fp8 branch.

jeejeelee and others added 30 commits December 24, 2024 13:05
…sampler (vllm-project#10681)

Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
…zation (vllm-project#11523)

Signed-off-by: mgoin <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Signed-off-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: robertgshaw2-neuralmagic <[email protected]>
noemotiovon and others added 21 commits January 13, 2025 15:47
* Commiting the *multilingual* P3L test.

* Created a *multi-lingual* P3L test.

* Making ruff happy.

* .

* Added a reference to the language-scripture Confluence table.

* Typo fixing.

* Harmonizing naming.

* Fixing comments in the header.

---------

Co-authored-by: Alexei V. Ivanov <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676)

Signed-off-by: Jee Jee Li <[email protected]>

* [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690)

Signed-off-by: Chen Zhang <[email protected]>

* [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776)

Signed-off-by: DarkLight1337 <[email protected]>

* [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648)

Signed-off-by: yisheng <[email protected]>

* [Doc][3/N] Reorganize Serving section (vllm-project#11766)

Signed-off-by: DarkLight1337 <[email protected]>

* [Kernel][LoRA]Punica prefill  kernels fusion (vllm-project#11234)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Abatom <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>

* [Bugfix] Update attention interface in `Whisper` (vllm-project#11784)

Signed-off-by: Roger Wang <[email protected]>

* [CI] Fix neuron CI and run offline tests (vllm-project#11779)

Signed-off-by: Liangfu Chen <[email protected]>

* fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768)

* [Doc] Create a vulnerability management team (vllm-project#9925)

Signed-off-by: Russell Bryant <[email protected]>

* [CI][CPU] adding build number to docker image name (vllm-project#11788)

Signed-off-by: Yuan Zhou <[email protected]>

* [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798)

Signed-off-by: Roger Wang <[email protected]>

* [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800)

Signed-off-by: DarkLight1337 <[email protected]>

* [doc] add doc to explain how to use uv (vllm-project#11773)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Support audio language models on V1 (vllm-project#11733)

Signed-off-by: Roger Wang <[email protected]>

* [doc] update how pip can install nightly wheels (vllm-project#11806)

Signed-off-by: youkaichao <[email protected]>

* [Doc] Add note to `gte-Qwen2` models (vllm-project#11808)

Signed-off-by: DarkLight1337 <[email protected]>

* [optimization] remove python function call for custom op (vllm-project#11750)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] update the prefix for qwen2 (vllm-project#11795)

Co-authored-by: jiadi.jjd <[email protected]>

* [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417)

Signed-off-by: Sourashis Roy <[email protected]>

* [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794)

* [Doc] Group examples into categories (vllm-project#11782)

Signed-off-by: Harry Mellor <[email protected]>

* [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] sort torch profiler table by kernel timing (vllm-project#11813)

* Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824)

* Fixed docker build for ppc64le (vllm-project#11518)

Signed-off-by: Nishidha Panpaliya <[email protected]>

* [OpenVINO] Fixed Docker.openvino build (vllm-project#11732)

Signed-off-by: Ilya Lavrenov <[email protected]>

* [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810)

Signed-off-by: Jee Jee Li <[email protected]>

* [Docs] reorganize sponsorship page (vllm-project#11639)

Signed-off-by: simon-mo <[email protected]>

* [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825)

Signed-off-by: DarkLight1337 <[email protected]>

* [misc] improve memory profiling (vllm-project#11809)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [doc] update wheels url (vllm-project#11830)

Signed-off-by: youkaichao <[email protected]>

* [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833)

* [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696)

Signed-off-by: Wallas Santos <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [torch.compile] consider relevant code in compilation cache (vllm-project#11614)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Reorganize profiling/processing-related code (vllm-project#11812)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Move examples into categories (vllm-project#11840)

Signed-off-by: Harry Mellor <[email protected]>

* [Doc][4/N] Reorganize API Reference (vllm-project#11843)

Signed-off-by: DarkLight1337 <[email protected]>

* [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836)

Signed-off-by: jiang1.li <[email protected]>

* [Bugfix][XPU] fix silu_and_mul (vllm-project#11823)

Signed-off-by: yan ma <[email protected]>

* [Misc] Move some model utils into vision file (vllm-project#11848)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Expand Multimodal API Reference (vllm-project#11852)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc]add some explanations for BlockHashType (vllm-project#11847)

* [TPU][Quantization] TPU `W8A8` (vllm-project#11785)

Co-authored-by: Woosuk Kwon <[email protected]>

* [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698)

Signed-off-by: Randall Smith <[email protected]>

* [Docs] Add Google Cloud Meetup (vllm-project#11864)

* [CI] Turn on basic correctness tests for V1 (vllm-project#10864)

* treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815)

Signed-off-by: Max de Bayser <[email protected]>

* [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849)

Signed-off-by: mgoin <[email protected]>

* [Misc] Move `print_*_once` from utils to logger (vllm-project#11298)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>

* [Doc] Intended links Python multiprocessing library (vllm-project#11878)

* [perf]fix current stream (vllm-project#11870)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875)

Signed-off-by: Ye Qi <[email protected]>
Co-authored-by: yeq <[email protected]>

* [Doc] Add model development API Reference (vllm-project#11884)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] Allow platform specify attention backend (vllm-project#11609)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>

* [ci]try to fix flaky multi-step tests (vllm-project#11894)

Signed-off-by: youkaichao <[email protected]>

* [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891)

Signed-off-by: DarkLight1337 <[email protected]>

* [Docs] Add Modal to deployment frameworks (vllm-project#11907)

* [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896)

Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

* [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Show default pooling method in a table (vllm-project#11904)

Signed-off-by: DarkLight1337 <[email protected]>

* [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677)

Signed-off-by: Chen Zhang <[email protected]>

* [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916)

Signed-off-by: Kunshang Ji <[email protected]>

* [ci] fix gh200 tests (vllm-project#11919)

Signed-off-by: youkaichao <[email protected]>

* [misc] remove python function call for custom activation op (vllm-project#11885)

Co-authored-by: youkaichao <[email protected]>

* [platform] support pytorch custom op pluggable (vllm-project#11328)

Signed-off-by: wangxiyuan <[email protected]>

* Replace "online inference" with "online serving" (vllm-project#11923)

Signed-off-by: Harry Mellor <[email protected]>

* [ci] Fix sampler tests (vllm-project#11922)

Signed-off-by: youkaichao <[email protected]>

* [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] support custom torch.compile backend key (vllm-project#11318)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Doc] Rename offline inference examples (vllm-project#11927)

Signed-off-by: Harry Mellor <[email protected]>

* [Docs] Fix docstring in `get_ip` function (vllm-project#11932)

Signed-off-by: Kuntai Du <[email protected]>

* Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933)

Signed-off-by: Kuntai Du <[email protected]>

* [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831)

Signed-off-by: jiang1.li <[email protected]>

* [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930)

Signed-off-by: Isotr0py <[email protected]>

* [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920)

Signed-off-by: Ren MinMin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>

* [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939)

Signed-off-by: Travis Johnson <[email protected]>

* [mypy] Fix mypy warnings in api_server.py (vllm-project#11941)

Signed-off-by: Fred Reiss <[email protected]>

* [ci] fix broken distributed-tests-4-gpus (vllm-project#11937)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672)

Signed-off-by: Sungjae Lee <[email protected]>

* [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921)

Signed-off-by: shaochangxu.scx <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>

* [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Basic guide for writing unit tests for new models (vllm-project#11951)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix RobertaModel loading (vllm-project#11940)

Signed-off-by: NickLucche <[email protected]>

* [Model] Add cogagent model support vLLM (vllm-project#11742)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [V1] Avoid sending text prompt to core engine (vllm-project#11963)

Signed-off-by: Roger Wang <[email protected]>

* [CI/Build] Add markdown linter (vllm-project#11857)

Signed-off-by: Rafael Vasquez <[email protected]>

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: [email protected] <[email protected]>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: [email protected] <[email protected]>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <[email protected]>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <[email protected]>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <[email protected]>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <[email protected]>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <[email protected]>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <[email protected]>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <[email protected]>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <[email protected]>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <[email protected]>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <[email protected]>

* Using list

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Trying to make scales work with compileable attention

* Docs lint

---------

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: yisheng <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Liangfu Chen <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Nishidha Panpaliya <[email protected]>
Signed-off-by: Ilya Lavrenov <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: yan ma <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Ye Qi <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Kuntai Du <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Ren MinMin <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Fred Reiss <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: shaochangxu.scx <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Yida Wu <[email protected]>
Signed-off-by: Chenguang Li <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: elijah <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: jiangjiadi <[email protected]>
Co-authored-by: jiadi.jjd <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: Jie Fu (傅杰) <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: WangErXiao <[email protected]>
Co-authored-by: Nishidha <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Wallas Henrique <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: rasmith <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>
Co-authored-by: Guspan Tanadi <[email protected]>
Co-authored-by: Ye (Charlotte) Qi <[email protected]>
Co-authored-by: yeq <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Co-authored-by: Charles Frye <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: cennn <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: minmin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Fred Reiss <[email protected]>
Co-authored-by: Sungjae Lee <[email protected]>
Co-authored-by: shaochangxu <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>
Co-authored-by: Nicolò Lucchesi <[email protected]>
Co-authored-by: sixgod <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: Akshat Tripathi <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Avshalom Manevich <[email protected]>
Co-authored-by: Yangcheng Li <[email protected]>
Co-authored-by: Siyuan Li <[email protected]>
Co-authored-by: Concurrensee <[email protected]>
Co-authored-by: Chenguang Li <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Shanshan Shen <[email protected]>
Co-authored-by: elijah <[email protected]>
@tjtanaa tjtanaa marked this pull request as ready for review January 15, 2025 06:09
@tjtanaa tjtanaa marked this pull request as draft January 15, 2025 06:09
@tjtanaa tjtanaa marked this pull request as ready for review January 15, 2025 07:56
@tjtanaa tjtanaa marked this pull request as draft January 15, 2025 07:59
@tjtanaa tjtanaa marked this pull request as ready for review January 15, 2025 08:04
Copy link

@hongxiayang hongxiayang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@hongxiayang hongxiayang merged commit d9385b4 into ROCm:llama_fp8_12062024 Jan 15, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.