Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add some preset test for vllm #694

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/e2e-preset-configs.json
Original file line number Diff line number Diff line change
Expand Up @@ -70,15 +70,15 @@
"name": "phi-3-mini-4k-instruct",
"node-count": 1,
"node-vm-size": "Standard_NC6s_v3",
"node-osdisk-size": 50,
"node-osdisk-size": 100,
"OSS": true,
"loads_adapter": false
},
{
"name": "phi-3-mini-128k-instruct",
"node-count": 1,
"node-vm-size": "Standard_NC6s_v3",
"node-osdisk-size": 50,
"node-osdisk-size": 100,
"OSS": true,
"loads_adapter": false
},
Expand Down
38 changes: 29 additions & 9 deletions .github/workflows/e2e-preset-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,17 @@ on:
type: boolean
default: false
description: "Test all Phi models for E2E"
test-on-vllm:
type: boolean
default: false
description: "Test on VLLM runtime"

env:
GO_VERSION: "1.22"
BRANCH_NAME: ${{ github.head_ref || github.ref_name}}
FORCE_RUN_ALL: ${{ github.event_name == 'workflow_dispatch' && github.event.inputs.force-run-all == 'true' }}
FORCE_RUN_ALL_PHI: ${{ github.event_name == 'workflow_dispatch' && github.event.inputs.force-run-all-phi-models== 'true' }}
RUNTIME: ${{ (github.event_name == 'workflow_dispatch' && github.event.inputs.test-on-vllm == 'true') && 'vllm' || 'hf' }}

permissions:
id-token: write
Expand Down Expand Up @@ -229,10 +234,10 @@ jobs:

- name: Replace IP and Deploy Resource to K8s
run: |
sed -i "s/MASTER_ADDR_HERE/${{ steps.get_ip.outputs.SERVICE_IP }}/g" presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}.yaml
sed -i "s/TAG_HERE/${{ matrix.model.tag }}/g" presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}.yaml
sed -i "s/REPO_HERE/${{ secrets.ACR_AMRT_USERNAME }}/g" presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}.yaml
kubectl apply -f presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}.yaml
sed -i "s/MASTER_ADDR_HERE/${{ steps.get_ip.outputs.SERVICE_IP }}/g" presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}_${{ env.RUNTIME }}.yaml
sed -i "s/TAG_HERE/${{ matrix.model.tag }}/g" presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}_${{ env.RUNTIME }}.yaml
sed -i "s/REPO_HERE/${{ secrets.ACR_AMRT_USERNAME }}/g" presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}_${{ env.RUNTIME }}.yaml
kubectl apply -f presets/test/manifests/${{ matrix.model.name }}/${{ matrix.model.name }}_${{ env.RUNTIME }}.yaml

- name: Wait for Resource to be ready
run: |
Expand All @@ -243,14 +248,10 @@ jobs:
run: |
POD_NAME=$(kubectl get pods -l app=${{ matrix.model.name }} -o jsonpath="{.items[0].metadata.name}")
kubectl logs $POD_NAME | grep "Adapter added:" | grep "${{ matrix.model.expected_adapter }}" || (echo "Adapter not loaded or incorrect adapter loaded" && exit 1)

- name: Test home endpoint
run: |
curl http://${{ steps.get_ip.outputs.SERVICE_IP }}:80/

- name: Test healthz endpoint
run: |
curl http://${{ steps.get_ip.outputs.SERVICE_IP }}:80/healthz
curl http://${{ steps.get_ip.outputs.SERVICE_IP }}:80/health

- name: Test inference endpoint
run: |
Expand Down Expand Up @@ -291,6 +292,25 @@ jobs:
}
}' \
http://${{ steps.get_ip.outputs.SERVICE_IP }}:80/generate
elif [[ "${{ env.RUNTIME }}" == *"vllm"*]]; then
echo "Testing inference for ${{ matrix.model.name }}"
curl -X POST \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"model": "test",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}' \
http://${{ steps.get_ip.outputs.SERVICE_IP }}:80/v1/chat/completions
else
echo "Testing inference for ${{ matrix.model.name }}"
curl -X POST \
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/kind-cluster/determine_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def read_yaml(file_path):
YAML_PR = read_yaml(supp_models_yaml)
# Format: {falcon-7b : {model_name:falcon-7b, type:text-generation, version: #, tag: #}}
MODELS = {model['name']: model for model in YAML_PR['models']}
KAITO_REPO_URL = "https://github.com/kaito-repo/kaito.git"
KAITO_REPO_URL = "https://github.com/kaito-project/kaito.git"

def set_multiline_output(name, value):
with open(os.environ['GITHUB_OUTPUT'], 'a') as fh:
Expand Down
61 changes: 39 additions & 22 deletions presets/models/supported_models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,16 @@ models:
# Falcon
- name: falcon-7b
type: text-generation
version: https://huggingface.co/tiiuae/falcon-7b/commit/898df1396f35e447d5fe44e0a3ccaaaa69f30d36
version: https://huggingface.co/tiiuae/falcon-7b/commit/ec89142b67d748a1865ea4451372db8313ada0d8
runtime: tfs
tag: 0.0.6
tag: 0.0.7
- name: falcon-7b-instruct
type: text-generation
version: https://huggingface.co/tiiuae/falcon-7b-instruct/commit/cf4b3c42ce2fdfe24f753f0f0d179202fea59c99
version: https://huggingface.co/tiiuae/falcon-7b-instruct/commit/8782b5c5d8c9290412416618f36a133653e85285
runtime: tfs
tag: 0.0.6
tag: 0.0.7
# Tag history:
# 0.0.7 - Support VLLM runtime
# 0.0.6 - Add Logging & Metrics Server
# 0.0.5 - Tuning and Adapters
# 0.0.4 - Adjust default model params (#310)
Expand All @@ -49,15 +50,16 @@ models:
# 0.0.1 - Initial Release
- name: falcon-40b
type: text-generation
version: https://huggingface.co/tiiuae/falcon-40b/commit/4a70170c215b36a3cce4b4253f6d0612bb7d4146
version: https://huggingface.co/tiiuae/falcon-40b/commit/05ab2ee8d6b593bdbab17d728de5c028a7a94d83
runtime: tfs
tag: 0.0.7
tag: 0.0.8
- name: falcon-40b-instruct
type: text-generation
version: https://huggingface.co/tiiuae/falcon-40b-instruct/commit/ecb78d97ac356d098e79f0db222c9ce7c5d9ee5f
runtime: tfs
tag: 0.0.7
tag: 0.0.8
# Tag history for 40b models:
# 0.0.8 - Support VLLM runtime
# 0.0.7 - Add Logging & Metrics Server
# 0.0.6 - Tuning and Adapters
# 0.0.5 - Adjust default model params (#310)
Expand All @@ -69,15 +71,16 @@ models:
# Mistral
- name: mistral-7b
type: text-generation
version: https://huggingface.co/mistralai/Mistral-7B-v0.3/commit/c882233d224d27b727b3d9299b12a9aab9dda6f7
version: https://huggingface.co/mistralai/Mistral-7B-v0.3/commit/d8cadc02ac76bd617a919d50b092e59d2d110aff
runtime: tfs
tag: 0.0.7
tag: 0.0.8
- name: mistral-7b-instruct
type: text-generation
version: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/commit/0417f4babd26db0b5ed07c1d0bc85658ab526ea3
version: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/commit/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db
runtime: tfs
tag: 0.0.7
tag: 0.0.8
# Tag history:
# 0.0.8 - Support VLLM runtime
# 0.0.7 - Add Logging & Metrics Server
# 0.0.6 - Update model version and Address missing weights files fix
# 0.0.5 - Tuning and Adapters
Expand All @@ -89,10 +92,11 @@ models:
# Phi-2
- name: phi-2
type: text-generation
version: https://huggingface.co/microsoft/phi-2/commit/b10c3eba545ad279e7208ee3a5d644566f001670
version: https://huggingface.co/microsoft/phi-2/commit/ef382358ec9e382308935a992d908de099b64c23
runtime: tfs
tag: 0.0.5
tag: 0.0.6
# Tag history:
# 0.0.6 - Support VLLM runtime
# 0.0.5 - Add Logging & Metrics Server
# 0.0.4 - Tuning and Adapters
# 0.0.3 - Adjust default model params (#310)
Expand All @@ -102,36 +106,49 @@ models:
# Phi-3
- name: phi-3-mini-4k-instruct
type: text-generation
version: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/commit/d269012bea6fbe38ce7752c8940fea010eea3383
version: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/commit/0a67737cc96d2554230f90338b163bc6380a2a85
runtime: tfs
tag: 0.0.2
tag: 0.0.3
# Tag history:
# 0.0.3 - Support VLLM runtime
# 0.0.2 - Add Logging & Metrics Server
# 0.0.1 - Initial Release

- name: phi-3-mini-128k-instruct
type: text-generation
version: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/commit/5be6479b4bc06a081e8f4c6ece294241ccd32dec
version: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/commit/a90b62ae09941edff87a90ced39ba5807e6b2ade
runtime: tfs
tag: 0.0.2
tag: 0.0.3
# Tag history:
# 0.0.3 - Support VLLM runtime
# 0.0.2 - Add Logging & Metrics Server
# 0.0.1 - Initial Release

- name: phi-3-medium-4k-instruct
type: text-generation
version: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/commit/d194e4e74ffad5a5e193e26af25bcfc80c7f1ffc
version: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/commit/ae004ae82eb6eddc32906dfacb1d6dfea8f91996
runtime: tfs
tag: 0.0.2
tag: 0.0.3
# Tag history:
# 0.0.3 - Support VLLM runtime
# 0.0.2 - Add Logging & Metrics Server
# 0.0.1 - Initial Release

- name: phi-3-medium-128k-instruct
type: text-generation
version: https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/commit/cae1d42b5577398fd1be9f0746052562ae552886
version: https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/commit/fa7d2aa4f5ea69b2e36b20d050cdae79c9bfbb3f
runtime: tfs
tag: 0.0.2
tag: 0.0.3
# Tag history:
# 0.0.3 - Support VLLM runtime
# 0.0.2 - Add Logging & Metrics Server
# 0.0.1 - Initial Release
# 0.0.1 - Initial Release

- name: phi-3.5-mini-instruct
type: text-generation
version: https://huggingface.co/microsoft/Phi-3.5-mini-instruct/commit/af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0
runtime: tfs
tag: 0.0.1
# Tag history:
# 0.0.1 - New Model! Support VLLM Runtime

55 changes: 55 additions & 0 deletions presets/test/manifests/falcon-7b/falcon-7b_vllm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: falcon-7b
spec:
replicas: 1
selector:
matchLabels:
app: falcon
template:
metadata:
labels:
app: falcon
spec:
containers:
- name: falcon-container
image: REPO_HERE.azurecr.io/falcon-7b:TAG_HERE
command:
- /bin/sh
- -c
- python3 /workspace/vllm/inference_api.py --served-model-name test --dtype bfloat16 --chat-template /workspace/chat_templates/falcon-instruct.jinja
resources:
requests:
nvidia.com/gpu: 2
limits:
nvidia.com/gpu: 2 # Requesting 2 GPUs
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 600 # 10 Min
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
tolerations:
- effect: NoSchedule
key: sku
operator: Equal
value: gpu
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
nodeSelector:
pool: falcon7b
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: phi-3-medium-128k-instruct
spec:
replicas: 1
selector:
matchLabels:
app: phi-3-medium-128k-instruct
template:
metadata:
labels:
app: phi-3-medium-128k-instruct
spec:
containers:
- name: phi-3-medium-128k-instruct-container
image: REPO_HERE.azurecr.io/phi-3-medium-128k-instruct:TAG_HERE
command:
- /bin/sh
- -c
- python3 /workspace/vllm/inference_api.py --served-model-name test --dtype float16
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1 # Requesting 1 GPU
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 600 # 10 Min
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
tolerations:
- effect: NoSchedule
key: sku
operator: Equal
value: gpu
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
nodeSelector:
pool: phi3medium12
Loading
Loading