Skip to content

MLOps library for LLM deployment w/ the vLLM engine on RunPod's infra.

License

Notifications You must be signed in to change notification settings

InquestGeronimo/superlaser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SuperLaser

⚠️Not yet ready for primetime ⚠️

SuperLaser provides a comprehensive suite of tools and scripts designed for deploying LLMs onto RunPod's serverless infrastructure. Additionally, the deployment utilizes a containerized vLLM engine during runtime, ensuring memory-efficient and high-performance inference capabilities.

Features

  • Scalable Deployment: Easily scale your LLM inference tasks with vLLM and RunPod serverless capabilities.
  • Cost-Effective: Optimize resource and hardware usage: tensor parallelism and other GPU assets.
  • Uses OpenAI's API: Use the OpenAI client for chat, completion, and streaming options.

Install

pip install superlaser

Before you begin, ensure you have:

  • A RunPod account.

RunPod Config

First step is to obtain an API key from RunPod. Go to your account's console, in the Settings section, click on API Keys.

After obtaining a key, set it as an environment variable:

export RUNPOD_API_KEY=<YOUR-API-KEY>

Configure Template

Before spinning up a serverless endpoint, let's first configure a template that we'll pass into the endpoint during staging. The template allows you to set vLLMs Docker image, model, and the container's and volume's disk space:

import os
from superlaser import RunpodHandler as runpod

api_key = os.environ.get("RUNPOD_API_KEY")

template_data = runpod.set_template(
    serverless="true",                                      
    template_name="superlaser-inf",                         # Give a name to your template
    container_image="runpod/worker-vllm:0.3.1-cuda12.1.0",  # Docker image stub
    model_name="mistralai/Mistral-7B-v0.1",                 # Hugging Face model stub
    max_model_length=340,                                   # Maximum number of tokens for the engine to handle per request.
    container_disk=15,                                      
    volume_disk=15,
)

Create Template on RunPod

template = runpod(api_key, data=template_data)
print(template().text)

Configure Endpoint

After your template is created, it will return a data dicitionary that includes your template ID. We will pass this template id when configuring the serverless endpoint in the section below:

endpoint_data = runpod.set_endpoint(
    gpu_ids="AMPERE_24", # options for gpuIds are "AMPERE_16,AMPERE_24,AMPERE_48,AMPERE_80,ADA_24"
    idle_timeout=5,
    name="vllm_endpoint",
    scaler_type="QUEUE_DELAY",
    scaler_value=1,
    template_id="template-id",
    workers_max=1,
    workers_min=0,
)

Start Endpoint on RunPod

endpoint = runpod(api_key, data=endpoint_data)
print(endpoint().text)

Call Endpoint

After your endpoint is staged, it will return a dictionary with your endpoint ID. Pass this endpoint ID to the OpenAI client and start making API requests!

from openai import OpenAI

endpoint_id = "you-endpoint-id"

client = OpenAI(
    api_key=api_key,
    base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
)

Chat w/ Streaming

stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    messages=[{"role": "user", "content": "To be or not to be"}],
    temperature=0,
    max_tokens=100,
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Completion w/ Streaming

stream = client.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    prompt="To be or not to be",
    temperature=0,
    max_tokens=100,
    stream=True,
)

for response in stream:
    print(response.choices[0].text or "", end="", flush=True)

Releases

No releases published

Packages

No packages published

Languages