Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build] Adds Modal runners for performance benchmark #11239

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

erik-dunteman
Copy link

@erik-dunteman erik-dunteman commented Dec 16, 2024

This PR is to improve performance benchmark by:

  • decreasing overall GPU spend by scaling to zero when not in use
  • allowing concurrent jobs to run

We do this by moving from single always-on GPU agents to CPU-based runners running Modal client to spawn GPUs as needed. Currently for A100 and H100.

Structure

launch-modal-runner.py is our new job script, which will build and launch the following image onto the targeted GPU, and execute the bash command just as is done in the current docker plugin setup.

BASE_IMG = f"public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:{BUILDKITE_COMMIT}"

Admin changes required:

  • change A100 and H100 buildkite queue to CPU agents
  • add modal_token_id and modal_token_secret to buildkite agent secrets

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Dec 16, 2024
@erik-dunteman
Copy link
Author

@simon-mo this PR is a WIP, need your advice on a couple things:

General integration questions

  • is it ok to move to CPU based buildkite agents for the A100 and H100 queues? Is there anything I could do to set that up or do you already have cpu agents that could be used?

Env issues - VLLM dev install

When running run-performance-benchmarks.sh in the public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:{BUILDKITE_COMMIT} image, the python3 -m vllm.entrypoints.openai.api_server commands are failing with module not found: vllm.

Running run-performance-benchmarks.sh in the image is similar to what's currently done with the docker setup, minus the devshm mount, so perhaps you're mounting in the library?

I've tried installing VLLM manually, both in the image build step with a pip install from git source, and at runtime (as you see in the current state of this PR). Logs here for running the current setup: https://gist.github.com/erik-dunteman/f75f0733ac6a78de73d25220a4a3f58a

Would love your guidance on what's missing for getting VLLM installed. Ideally in the build step to keep billable GPU time down, but at runtime if needed.

@erik-dunteman
Copy link
Author

will address ruff and other checks once I get above VLLM install issue sorted and confirm scripts run as expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant