Realign is a tool to test and evaluate your LLM agents blazing fast. We'll provide you with the rools to make 100s of LLM calls in parallel, with ANY model.
- Simple Tweet Bot
- Generate 10 Tweets in Parallel (Async)
- Using Config Files
- Set up Evaluators
- Using Realign Evaluators
Let's start by creating a simple Twitter (X?) bot:
from realign import llm_messages_call
def tweetbot(prompt: str) -> str:
new_message = llm_messages_call(
messages=[{
"role": "user",
"content": prompt
}],
)
print('\nTweet:\n\n', new_message.content, '\n\n')
tweetbot("Write a tweet about best practices for prompt engineering")
You should get something that looks like this:
Created model router for openai/gpt-4o-mini
Successfully called openai/gpt-4o-mini in 1.151 seconds.
Tweet:
🔍✨ Best practices for prompt engineering:
1️⃣ Be specific: Clearly define your objectives.
2️⃣ Iterative testing: Refine prompts based on outputs.
3️⃣ Context is key: Provide relevant background info.
4️⃣ Experiment with formats: Questions, completion, or examples.
5️⃣ Use constraints: Limit scope for focused responses.
#PromptEngineering #AI #MachineLearning
Great! The logs start by telling you which model you are using. The model router
protects you from rate limits and handles retries.
You must have noticed that the default model is openai/gpt-4o-mini
. To change your model, simply enter a new model using model=
.
new_message = llm_messages_call(
model='anthropic/claude-3-5-sonnet-20240620',
messages=[{
"role": "user",
"content": prompt
}],
)
Realign leverages LiteLLM to help you plug in any model in a single line of code using the format <provider>/<model>
. You can find their 100+ supported models here.
Now let's assume you want to create 10 tweets and see which one looks best.
You could do this by calling llm_messages_call
10 times. However, this would be slow. Let's try it:
from realign import llm_messages_call
import time
def tweet(prompt, i):
new_message = llm_messages_call(
messages=[{
"role": "user",
"content": prompt
}],
)
print(f'\nTweet {i+1}:\n\n', new_message.content, '\n\n')
def main(prompt):
for i in range(10):
tweet(prompt, i)
start_time = time.time()
main("Write a tweet about best practices for prompt engineering")
print(f'Total time: {time.time() - start_time:.2f} seconds')
You should see something like:
Created model router for openai/gpt-4o-mini
Successfully called openai/gpt-4o-mini in 1.256 seconds.
Successfully called openai/gpt-4o-mini in 1.445 seconds.
Successfully called openai/gpt-4o-mini in 1.449 seconds.
Successfully called openai/gpt-4o-mini in 1.448 seconds.
Successfully called openai/gpt-4o-mini in 1.552 seconds.
Successfully called openai/gpt-4o-mini in 1.554 seconds.
Successfully called openai/gpt-4o-mini in 1.554 seconds.
Successfully called openai/gpt-4o-mini in 1.600 seconds.
Successfully called openai/gpt-4o-mini in 1.633 seconds.
Successfully called openai/gpt-4o-mini in 1.906 seconds.
Tweet 1:
🚀✨ Best practices for prompt engineering:
1. **Be Clear
... 10 tweets
Total time: 15.58 seconds
It takes about 15 seconds because the API calls happen sequentially. To run these API calls in parallel, you can use the allm_messages_call
method and run_async
utility.
from realign import allm_messages_call, run_async
import time
async def tweet(prompt, i):
new_message = await allm_messages_call(
messages=[{
"role": "user",
"content": prompt
}],
)
print(f'\nTweet {i+1}:\n\n', new_message.content, '\n\n')
def main(prompt):
tasks = []
for i in range(10):
tasks.append(tweet(prompt, i))
# run the tasks in parallel
run_async(tasks)
start_time = time.time()
main("Write a tweet about best practices for prompt engineering")
print(f'Total time: {time.time() - start_time:.2f} seconds')
You should see something like:
Created model router for openai/gpt-4o-mini
openai/gpt-4o-mini Processing requests...
Successfully called openai/gpt-4o-mini in 1.191 seconds.
Successfully called openai/gpt-4o-mini in 1.374 seconds.
Successfully called openai/gpt-4o-mini in 1.446 seconds.
Successfully called openai/gpt-4o-mini in 1.476 seconds.
Successfully called openai/gpt-4o-mini in 1.478 seconds.
Successfully called openai/gpt-4o-mini in 1.575 seconds.
Successfully called openai/gpt-4o-mini in 1.662 seconds.
Successfully called openai/gpt-4o-mini in 1.944 seconds.
Successfully called openai/gpt-4o-mini in 2.383 seconds.
Successfully called openai/gpt-4o-mini in 2.467 seconds.
Tweet 1:
✨ Mastering prompt engineering? Here are some best practices! ✨
... 10 tweets
Total time: 2.47 seconds
The total time is now the time taken by the longest API call instead of the total time taken by all calls. This allows you to iterate much quicker!
Here's a summary of the main changes compared to the synchronous run:
-
Change the method of
tweet
fromdef
toasync def
. -
Change
llm_messages_call
toallm_messages_call
which is the async version, and add theawait
keyword before it. -
In the for loop, we add the outputs of the
tweet
calls to atasks
list. This contains our 10 scheduled tasks. These tasks are called coroutines in Python and are similar to promises in Javascript. More info on how coroutines work here. -
Finally, we use the
run_async
utility which takes a single list of coroutines, and waits until they are all complete. Iftweet
returned something, therun_async
function would return a list of returns.
You might want the Tweet to follow certain rules and patterns. To do this, we can use a prompt template:
from realign import allm_messages_call, run_async
async def tweet(prompt, i):
new_message = await allm_messages_call(
messages=[{
"role": "user",
"content": prompt
}],
)
print(f'\nTweet {i+1}:\n\n', new_message.content, '\n\n')
def main(prompt):
tasks = []
for i in range(10):
tasks.append(tweet(prompt, i))
# run the tasks in parallel
run_async(tasks)
prompt = "Write a tweet about best practices for prompt engineering"
template = f'''
As an AI language model, your task is to create a high-quality tweet. Follow these guidelines:
1. Length: Keep the tweet within 280 characters, including spaces and punctuation.
2. Purpose: Clearly define the tweet's purpose (e.g., informing, entertaining, inspiring, or promoting).
3. Audience: Consider the target audience and tailor the language and content accordingly.
4. Tone: Maintain a consistent tone and don't sound salesy or disingenuous. Be based and direct.
5. Content: Ensure the tweet is:
- Relevant and timely
- Accurate and factual (if sharing information)
- Original and engaging
- Clear and concise
6. Structure:
- Start with a hook or attention-grabbing element
- Present the main idea or message clearly
- End with a thought-provoking insight (if appropriate)
7. Hashtags: Include 1-2 relevant hashtags, if applicable, to increase discoverability.
8. Proofread: Ensure there are no spelling or grammatical errors.
After generating the tweet, review it to ensure it meets these criteria and make any necessary adjustments to optimize its quality and potential engagement.
Here is your instruction:
{prompt}
'''
main(template)
However, this is somewhat difficult to maintain in a file which should ideally only have your code! Realign uses YAML configuration files to keep track of all settings that are stateless (things that don't depend on the application runtime state).
This can include things like model, system_prompt, template, temperature, etc.
Let's use a config file for this example. Create a config.yaml
file in your directory, and paste the following:
llm_agents:
tweetbot:
model: openai/gpt-4o-mini
template: |
As an AI language model, your task is to create a high-quality tweet. Follow these guidelines:
1. Length: Keep the tweet within 280 characters, including spaces and punctuation.
2. Purpose: Clearly define the tweet's purpose (e.g., informing, entertaining, inspiring, or promoting).
3. Audience: Consider the target audience and tailor the language and content accordingly.
4. Tone: Maintain a consistent tone and don't sound salesy or disingenuous. Be based and direct.
5. Content: Ensure the tweet is:
- Relevant and timely
- Accurate and factual (if sharing information)
- Original and engaging
- Clear and concise
6. Structure:
- Start with a hook or attention-grabbing element
- Present the main idea or message clearly
- End with a thought-provoking insight (if appropriate)
7. Hashtags: Include 1-2 relevant hashtags, if applicable, to increase discoverability.
8. Proofread: Ensure there are no spelling or grammatical errors.
After generating the tweet, review it to ensure it meets these criteria and make any necessary adjustments to optimize its quality and potential engagement.
Here is your instruction:
{{prompt}}
Some notes on the YAML file:
-
The key 'tweetbot' is the agent name. This is referenced in the code using the
agent_name
param. -
The
model
field specifies the model in<provider>/<model>
format. -
The
template
field is a Jinja template (string with double curly braces for variables, like{{var}}
). The template is rendered using a Python dictionary which maps the variable key string with the rendered string for that variable name. This is passed intoallm_messages_call
using thetemplate_param
field.
You can find the full specification for the YAML config here.
We can use this agent in our code by modifying the allm_messages_call
params:
async def tweet(prompt, i):
new_message = await allm_messages_call(
agent_name='tweetbot',
template_params={'prompt': prompt}
)
print(f'\nTweet {i+1}:\n\n', new_message.content, '\n\n')
Finally, we just call main
with our prompt:
main("Write a tweet about best practices for prompt engineering")
Running this will use the config file settings to run your agent. To make changes to the agent's config, you can make changes to the agent config directly. This allows you to quickly play with various settings including:
-
model (string)
-
hyperparams (model-specific dict)
-
system_prompt (string)
-
template (string)
-
template_params (key-value map)
-
json_mode (bool)
-
role ('assistant' / 'user')
For example, we can set the temperature to 1 by adding the following line:
llm_agents:
tweetbot:
model: openai/gpt-4o-mini
hyperparams:
temperature: 1.0
template: |
As an AI language model, ...
Now that you have a Tweet agent that can generate tweets quickly, you might want to evaluate the responses to get make sure the quality is high. To do this, we can set up evaluators.
But what is an evaluator?
An Evaluator is a function which scores your app's output and checks if the score is within a target range.
There are a few ways to implement Evaluators:
-
Python Evaluator: simple (or complex) python functions.
-
Local Model Evaluator: these use a model on your local device
-
LLM Evaluator: these use an LLM to generate a score for your application
In Realign, ANY Python function can become an evaluator if it is decorated with @evaluator (or @aevaluator if function is async). A Realign evaluator's kwargs, repeat, targets and other stateless settings can be configured in the YAML file using the funciton name as the key.
Let's build 4 evaluators:
-
Tweet should be under 280 characters (simple Python function)
-
Tweet should not be hateful (Hugginface model)
-
Tweet should be of positive sentiment (Hugginface model)
-
Tweet should score highly on specified criteria (LLM call)
To evaluate this metric, we simply measure the Python len() and assert that we are within the character limit. Pretty simple.
def tweet_length_checker(tweet_text):
# Evaluation = assert that score(output) within target range
num_chars = len(tweet_text) # score(output)
# assert that score is within target range
assert 0 < num_chars <= 280
# Evaluation Result
return eval_result
For this metric, we can use one of Huggingface's Text Classification models. Here's a function you can use:
First install
pip install transformers datasets evaluate accelerate torch
Then paste
from transformers import pipeline
import torch
def hate_speech_detection(tweet_text):
# Evaluation = assert that score(output) within target range
# score = hate / nothate
# target = nothate
# Check if MPS is available (Mac Apple Silicon)
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS (Metal) device for GPU acceleration")
else:
device = torch.device("cpu")
print("MPS not available, using CPU")
# Create the pipeline with the specified device
pipe = pipeline(task='text-classification',
model='facebook/roberta-hate-speech-dynabench-r4-target',
device=device)
# get the response
response = pipe(tweet_text)
assert response[0]['label'] == 'nothate'
return response[0]['label']
This will return something like
[{'label': 'nothate', 'score': 0.9998}]
Similar to the last example, we can use Huggingface pipelines:
from transformers import pipeline
import torch
def sentiment_detection(tweet_text):
# Evaluation = assert that score(output) within target range
# score = positive / neutral / negative
# target = [positive, neutral]
# Check if MPS is available (Mac Apple Silicon)
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS (Metal) device for GPU acceleration")
else:
device = torch.device("cpu")
print("MPS not available, using CPU")
# Create the pipeline with the specified device
pipe = pipeline(task='text-classification',
model='cardiffnlp/twitter-roberta-base-sentiment-latest',
device=device)
# get the response
response = pipe(tweet_text)
assert response[0]['label'] in ['positive', 'neutral']
return response[0]
Feeling icky with all the code duplication? In Realign, we can pull out all the static / stateless settings into the config file. This helps you cleanly manage your evaluators. Moreover, Realign provides in-built configurations and evaluators.
In Realign, any Python function can be converted to an Evaluator using the @evaluator
decorator. If your function is async, you can use the @aevaluator
decorator.
In our case, Realign @evaluator can help you create a wrapper around a base implementation of Huggingface pipeline, abstracting out any hyperparams you like.
We'll do this in 3 steps:
-
Create the base evaluator with the right keyword args (in this case, task and model)
-
Create a config file and add the Huggingface evaluator settings
-
Use these evaluators in your code
Let's begin:
- Create the base evaluator with the right keyword args (in this case, task and model)
from transformers import pipeline
import torch
from realign.evaluators import evaluator
@evaluator
def hf_pipeline(tweet_text, task=None, model=None):
assert task and model # task and model should be specified
# Check if MPS is available (Mac Apple Silicon)
if torch.backends.mps.is_available():
device = torch.device("mps")
print("Using MPS (Metal) device for GPU acceleration")
else:
device = torch.device("cpu")
print("MPS not available, using CPU")
# Create the pipeline with the specified device
pipe = pipeline(task=task,
model=model,
device=device)
# get the response
print(f'\n\nRunning hf_pipeline with task {task} and model {model}\n\n')
response = pipe(tweet_text)
return response[0]
- Create a config file and add the Huggingface evaluator settings
For the config, let's create a file in the same directory called config.yaml
, and paste the following:
evaluators:
hf_hate_speech:
wraps: hf_pipeline
model: facebook/roberta-hate-speech-dynabench-r4-target
task: text-classification
checker: value['label'] in target
target: [nothate]
asserts: on
hf_sentiment_classifier:
wraps: hf_pipeline
task: text-classification
model: cardiffnlp/twitter-roberta-base-sentiment-latest
checker: value['label'] in target
target: [positive, neutral]
asserts: on
- Use these evaluators in your code
We can update our tweet function as follows:
async def tweet(prompt, i):
new_message = await allm_messages_call(
agent_name='tweetbot',
template_params={'prompt': prompt}
)
# evaluate for hate speech
evaluators['hf_hate_speech'](new_message.content)
# evaluate for positive sentiment
evaluators['hf_sentiment_classifier'](new_message.content)
Using the YAML file helps you tweak different hyperparameters and settings easily wihthout changing your code. For example, you can change the model or task, or adjust the target classes. This is a convenient way to adjust hyperparams for various wrappers. Of course, you can manually create your own wrappers as well.
Let's quickly switch to configs for our tweet_length_checker
evaluator as well:
@evaluator
def tweet_char_count(text: str) -> int:
return len(text)
evaluators:
# ... other evaluators
tweet_length_checker:
checker: numrange(value, target)
target: (0, 280]
asserts: on
LLMs have great language and reasoning skills, and so we can use them to evaluate the responses of another LLM.
We can use Realign's allm_rating_json
evaluator, which is already implemented for you. You can check out its configuration here and implementation here. TL;DR: it uses the criteria to evaluate the messages, repeats it 3 times and aggregates the scores and explanations.
tweet_judge:
wraps: allm_rating_json
criteria: |
1. Make sure sentences are concise and don't use flowery language.
2. It shouldn't sound too salesy or promotional.
3. Don't use too many adjectives or adverbs.
4. Don't make general claims.
5. Don't start with a question.