Skip to content

newscorp-ghfb/judgement-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Model Evaluation Tool

This repository contains a Streamlit application designed to evaluate the performance of various Large Language Models (LLMs). The application allows users to upload text prompts, query the configured LLMs with these prompts, and then score the LLM responses for intent, correctness, and relevancy. The results are then stored in Google BigQuery, and visualized within the app.

Repository Structure

├── judgement_model_app.py         # Streamlit application code
├── src
│   ├── streamlit_app.py           # Streamlit application code
│   ├── llm_scoring.py             # Functions to score LLM responses
│   ├── llm_query.py               # Functions to query LLM models
│   ├── file_manager.py            # Functions to manage prompt files
│   ├── bigquery_manager.py        # Functions to interact with Google BigQuery
│   └── config_manager.py          # Functions to load configurations and initialize Vertex AI
├── config
│   ├── config.yaml                # YAML file containing application configurations
│   └── key.json                   # Service account key
├── static
│   ├── styling.css                # Custom CSS for the Streamlit application
│   ├── new.png                    # Custom image for background
├── prompts.txt                    # File to store prompt, one per line.
└── README.md

Script Descriptions

judgement_model_app.py

This script serves as the main entry point for the Streamlit application. It imports necessary modules and calls the main function from streamlit_app.py.

src/streamlit_app.py

This script contains the main logic for the Streamlit application. Here's a breakdown of its functionalities:

  • UI Setup: It sets up the user interface using Streamlit, including file upload, evaluation button, and result display.
  • Configuration Loading: Loads configurations such as model names, BigQuery settings, and judgement model using config_manager.
  • Prompt Management: Handles uploading and appending prompts to a prompts.txt file.
  • LLM Interaction:
    • Uses llm_query.py to query the LLMs with the provided prompts.
    • Uses llm_scoring.py to score the responses on intent, correctness, and relevancy.
  • Data Storage: Utilizes bigquery_manager.py to save the results to Google BigQuery.
  • Visualization: Queries BigQuery data and displays results as tables and charts using altair.
  • Concurrent Execution: Employs concurrent.futures for parallel execution of querying and scoring tasks.
  • CSS Styling: Applies custom CSS to enhance the user interface.

src/llm_scoring.py

This script contains functions to score the LLM responses based on intent, correctness, and relevancy. Each scoring function:

  • Takes the model name, query, and the LLM response as inputs.
  • Constructs a prompt for a judgement LLM to score the response.
  • Returns the score for each criteria.

src/llm_query.py

This script provides functions to query a Google LLM with a given prompt. It includes:

  • A function query_google_llm that takes a prompt and model name as input.
  • It returns the response from the LLM, along with associated metadata.

src/file_manager.py

This script manages reading and writing prompts to a text file. It contains:

  • read_prompts_from_file(): Reads prompts from the prompts.txt file.
  • append_prompts_to_file(): Appends new prompts to the prompts.txt file, avoiding duplicates.

src/bigquery_manager.py

This script handles interactions with Google BigQuery:

  • load_responses_to_bigquery: Uploads the responses and scores to a specified BigQuery table.
  • query_bigquery_results: Queries results from the specified BigQuery table.

src/config_manager.py

This script manages the application's configuration:

  • read_yaml: Reads a YAML file and extracts the configurations.
  • initialize_vertex_ai: Initializes Vertex AI with the service account credentials.
  • load_config: Loads the configuration file, initializes Vertex AI, and returns both configurations and credentials.

config/config.yaml

This is the configuration file used by config_manager.py. It specifies the LLM models to evaluate, the project and dataset details for BigQuery, and other application settings.

config/key.json

This file holds the service account key for authentication with Google Cloud services. Note: This file should be kept secure and not exposed publicly.

static/prompts.txt

This file holds the prompts used for LLM evaluation, with one prompt per line.

static/styling.css

This file contains the CSS styles used to customize the appearance of the Streamlit application.

static/new.png

This file contains the custom background image for the streamlit application.

Steps to Run the Streamlit Application

  1. Clone the Repository:
    git clone <repository_url>
    cd <repository_name>
  2. Install Dependencies:
    pip install -r requirements.txt
    Note: ensure you have correct google cloud SDK configured with your user account which has BigQuery access and also Vertex AI access to invoke the generative models.
  3. Set up Configuration:
    • Ensure that the config directory contains config.yaml and key.json.
    • Update the config.yaml with your project details, including BigQuery and model configurations.
    • Place your service account key file at config/key.json.
  4. Run the Streamlit App:
    streamlit run judgement_model_app.py
  5. Access the Application: Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

How to Use the Application

  1. Upload Prompts:
    • Prepare a text file with one prompt per line (e.g., prompts.txt or use the static/prompts.txt).
    • Use the file uploader in the "Step 1: Upload a Prompt File" section to upload the file.
  2. Evaluate:
    • Click the "Evaluate" button in the "Step 2: Hit Evaluate" section.
    • The application will query the configured LLMs with the uploaded prompts.
    • It will then score the LLM responses and save the results to BigQuery.
  3. View Results:
    • After evaluation is complete, the application will display the results using charts and a table.
    • Charts will show the prompt count and accuracy scores of each model.

Key Features

  • Multiple Model Support: Allows evaluation of multiple LLMs simultaneously.
  • Parallel Processing: Uses multi-threading to efficiently query and score responses.
  • Data Persistence: Stores evaluation results in BigQuery for analysis.
  • Interactive Visualization: Provides insightful charts to compare model performance.
  • Customizable Styling: Uses a CSS file for styling the application.
  • Customizable Background: Uses a image for background in the application

Contributing

Feel free to fork this repository and submit pull requests to improve the application.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published