This repository contains a Streamlit application designed to evaluate the performance of various Large Language Models (LLMs). The application allows users to upload text prompts, query the configured LLMs with these prompts, and then score the LLM responses for intent, correctness, and relevancy. The results are then stored in Google BigQuery, and visualized within the app.
├── judgement_model_app.py # Streamlit application code ├── src │ ├── streamlit_app.py # Streamlit application code │ ├── llm_scoring.py # Functions to score LLM responses │ ├── llm_query.py # Functions to query LLM models │ ├── file_manager.py # Functions to manage prompt files │ ├── bigquery_manager.py # Functions to interact with Google BigQuery │ └── config_manager.py # Functions to load configurations and initialize Vertex AI ├── config │ ├── config.yaml # YAML file containing application configurations │ └── key.json # Service account key ├── static │ ├── styling.css # Custom CSS for the Streamlit application │ ├── new.png # Custom image for background ├── prompts.txt # File to store prompt, one per line. └── README.md
This script serves as the main entry point for the Streamlit application. It imports necessary modules and calls the main function from streamlit_app.py
.
This script contains the main logic for the Streamlit application. Here's a breakdown of its functionalities:
- UI Setup: It sets up the user interface using Streamlit, including file upload, evaluation button, and result display.
- Configuration Loading: Loads configurations such as model names, BigQuery settings, and judgement model using
config_manager
. - Prompt Management: Handles uploading and appending prompts to a
prompts.txt
file. - LLM Interaction:
- Uses
llm_query.py
to query the LLMs with the provided prompts. - Uses
llm_scoring.py
to score the responses on intent, correctness, and relevancy.
- Uses
- Data Storage: Utilizes
bigquery_manager.py
to save the results to Google BigQuery. - Visualization: Queries BigQuery data and displays results as tables and charts using
altair
. - Concurrent Execution: Employs
concurrent.futures
for parallel execution of querying and scoring tasks. - CSS Styling: Applies custom CSS to enhance the user interface.
This script contains functions to score the LLM responses based on intent, correctness, and relevancy. Each scoring function:
- Takes the model name, query, and the LLM response as inputs.
- Constructs a prompt for a judgement LLM to score the response.
- Returns the score for each criteria.
This script provides functions to query a Google LLM with a given prompt. It includes:
- A function
query_google_llm
that takes a prompt and model name as input. - It returns the response from the LLM, along with associated metadata.
This script manages reading and writing prompts to a text file. It contains:
read_prompts_from_file()
: Reads prompts from theprompts.txt
file.append_prompts_to_file()
: Appends new prompts to theprompts.txt
file, avoiding duplicates.
This script handles interactions with Google BigQuery:
load_responses_to_bigquery
: Uploads the responses and scores to a specified BigQuery table.query_bigquery_results
: Queries results from the specified BigQuery table.
This script manages the application's configuration:
read_yaml
: Reads a YAML file and extracts the configurations.initialize_vertex_ai
: Initializes Vertex AI with the service account credentials.load_config
: Loads the configuration file, initializes Vertex AI, and returns both configurations and credentials.
This is the configuration file used by config_manager.py
. It specifies the LLM models to evaluate, the project and dataset details for BigQuery, and other application settings.
This file holds the service account key for authentication with Google Cloud services. Note: This file should be kept secure and not exposed publicly.
This file holds the prompts used for LLM evaluation, with one prompt per line.
This file contains the CSS styles used to customize the appearance of the Streamlit application.
This file contains the custom background image for the streamlit application.
- Clone the Repository:
git clone <repository_url> cd <repository_name>
- Install Dependencies:
Note: ensure you have correct google cloud SDK configured with your user account which has BigQuery access and also Vertex AI access to invoke the generative models.
pip install -r requirements.txt
- Set up Configuration:
- Ensure that the
config
directory containsconfig.yaml
andkey.json
. - Update the
config.yaml
with your project details, including BigQuery and model configurations. - Place your service account key file at
config/key.json
.
- Ensure that the
- Run the Streamlit App:
streamlit run judgement_model_app.py
- Access the Application: Open your web browser and navigate to the URL provided by Streamlit (usually
http://localhost:8501
).
- Upload Prompts:
- Prepare a text file with one prompt per line (e.g.,
prompts.txt
or use thestatic/prompts.txt
). - Use the file uploader in the "Step 1: Upload a Prompt File" section to upload the file.
- Prepare a text file with one prompt per line (e.g.,
- Evaluate:
- Click the "Evaluate" button in the "Step 2: Hit Evaluate" section.
- The application will query the configured LLMs with the uploaded prompts.
- It will then score the LLM responses and save the results to BigQuery.
- View Results:
- After evaluation is complete, the application will display the results using charts and a table.
- Charts will show the prompt count and accuracy scores of each model.
- Multiple Model Support: Allows evaluation of multiple LLMs simultaneously.
- Parallel Processing: Uses multi-threading to efficiently query and score responses.
- Data Persistence: Stores evaluation results in BigQuery for analysis.
- Interactive Visualization: Provides insightful charts to compare model performance.
- Customizable Styling: Uses a CSS file for styling the application.
- Customizable Background: Uses a image for background in the application
Feel free to fork this repository and submit pull requests to improve the application.