Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vllm-tensorizer): Update vllm-tensorizer cloned repository, build with vllm-flash-attn, other optimizations #72

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

sangstar
Copy link
Contributor

@sangstar sangstar commented Jun 17, 2024

vllm-tensorizer hasn't had updates since vLLM's formal adoption of tensorizer model loading. An update to build for the most recent commit to vLLM that includes sharded tensorizer support is presented, along with some fixes to successfully build vLLM with recent updates to the source code. These include:

  • Building vLLM's wheel from vLLM's source code proper, rather than CoreWeave's vLLM fork (reflecting its official adoption of tensorizer)
  • vLLM's adoption of cmake
  • Updated xformers version to 0.0.26.post1
  • vLLM now formally using their own forked version of flash-attn, which is built here from source

sangstar and others added 9 commits June 14, 2024 10:04
vLLM replaced their usages of the regular `flash-attn` library with
their own `vllm-flash-attn` fork, which, as of right now,
is fairly easy to compile. This change compiles it from source
for compatibility with the `ml-containers/torch` base images.

[skip ci]
build(vllm-tensorizer): Compile `vllm-flash-attn` from source
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants