David and Goliath: Domain-specific instruction fine-tuning of a lightweight LLM (phi-1.5) on synthetic data for use in RAG applications

This repository contains the code for instruction tuning microsoft/phi-1_5 on the Alpaca dataset and on the Synthetic Academic Dataset (SAI) to create two models: alekswael/phipaca and alekswael/saiphipaca. It also includes the code for generating the SAI dataset, Alpaca subset and a benchmark for testing the models performance on a Retrieval Augmented Generation (RAG) task.

Abstract from Wael & Baskakovs (2023):

The development of large language models (LLMs) like OpenAI's GPT-4 has yielded incredible results for NLP tasks. Despite their impressive capabilities, the scale of these models poses challenges in terms of computational resources and accessibility. This study addresses the issue of scale in LLMs by exploring the viability of smaller language models (SLMs) in achieving comparable performance to their larger counterparts. In this context, we instruction-tune a small 1.3B parameter model (Microsoft’s phi-1.5) using QLoRA, on two datasets; 1) using a subset of the Alpaca dataset for task-independent instruction-following abilities, and 2) using a generated Synthetic Academic Instruct dataset for task-dependent performance. The fine-tuned models, named phipaca and saiphipaca, are then evaluated against the base phi-1.5 model in a Retrieval Augmented Generation (RAG) task. Our evaluation, based on cosine similarity with outputs from a state-of-the-art model (gpt-3.5-turbo) and an inspection of model outputs, reveals that while the base model performs well, the fine-tuned models exhibit limitations, such as false information and hallucinations, suggesting a need for refinement in training methodologies.

Project Structure

data/: Contains the datasets used for training the models. It includes the SAI dataset, Alpaca dataset, and benchmark data.
models/: Contains the saved model checkpoints. It includes the phipaca checkpoint (alpaca_peft) and the saiphipaca checkpoint (synthetic_peft).
results/: Contains the results of the model training and evaluation. It includes chunk histograms, cosine similarity plots, and CSV files with results for different tasks.
src/: Contains the source code for the project. It includes code for benchmark scores, data generation, and model training.
run_tests.sh: A shell script to run the RAG-benchmarks.
requirements.txt: Contains the Python dependencies required for this project.

Installation

Be sure to have Python installed on your system before proceeding.

To create a Python virtual environment and install the required packages, open a bash terminal and run the following script (with the repo folder as root):

bash setup.sh

Usage

Data generation

To generate the SAI dataset, run the following script:

bash data_generation.sh

This script runs the scripts in the data_generation folder in the correct order.

Model training

The training of the models was done using Google Colab, utilising a Nvidia Tesla T4 GPU. The Jupyter Notebooks are available in the model_training folder. Be sure to modify these scripts to include your own paths.

RAG benchmarking

To test the models capabilities for RAG, run the following script:

bash run_tests.sh

This bash script performs all RAG benchmark tests, results can be found in the results folder. It creates cosine_similarity.png, two bar plots of the models' performance on the single PDF- and joint PDF task.

License

The project is licensed under the MIT license.

Full repository structure

├── data
│   ├── alpaca_dataset
│   │   ├── dataset_test.json
│   │   └── dataset_train.json
│   ├── benchmark_data
│   │   ├── all_questions_data
│   │   │   ├── benchmark_data.csv
│   │   │   └── benchmark_data.xlsx
│   │   ├── joint_paper
│   │   │   ├── joint_paper_data.csv
│   │   │   ├── joint_paper_data.xlsx
│   │   │   └── joint_paper.pdf
│   │   └── single_paper
│   │       ├── single_paper_data.csv
│   │       ├── single_paper_data.xlsx
│   │       └── single_paper.pdf
│   └── SAI_dataset
│       ├── dataset_test.json
│       ├── dataset_train.json
│       └── SAI_dataset_0312.csv
├── LICENSE
├── models
│   ├── alpaca_peft
│   │   ├── checkpoints
│   │   │   └── checkpoint-1000
│   │   │       ├── adapter_config.json
│   │   │       ├── adapter_model.safetensors
│   │   │       ├── optimizer.pt
│   │   │       ├── README.md
│   │   │       ├── rng_state.pth
│   │   │       ├── scheduler.pt
│   │   │       ├── trainer_state.json
│   │   │       └── training_args.bin
│   │   ├── loss_epochs.png
│   │   └── loss_steps.png
│   └── synthetic_peft
│       ├── checkpoints
│       │   └── checkpoint-267
│       │       ├── adapter_config.json
│       │       ├── adapter_model.safetensors
│       │       ├── optimizer.pt
│       │       ├── README.md
│       │       ├── rng_state.pth
│       │       ├── scheduler.pt
│       │       ├── trainer_state.json
│       │       └── training_args.bin
│       ├── loss_epochs.png
│       └── loss_steps.png
├── README.md
├── requirements.txt
├── results
│   ├── chunk_histograms
│   │   ├── joint_paper_histogram_phi-1_5.png
│   │   ├── joint_paper_histogram_phipaca.png
│   │   ├── joint_paper_histogram_saiphipaca.png
│   │   ├── single_paper_histogram_phi-1_5.png
│   │   ├── single_paper_histogram_phipaca.png
│   │   └── single_paper_histogram_saiphipaca.png
│   ├── cosine_similarity
│   │   ├── cosine_similarity.png
│   │   ├── joint_paper_results_phi-1_5.csv
│   │   ├── joint_paper_results_phipaca.csv
│   │   ├── joint_paper_results_saiphipaca.csv
│   │   ├── single_paper_results_phi-1_5.csv
│   │   ├── single_paper_results_phipaca.csv
│   │   └── single_paper_results_saiphipaca.csv
│   ├── joint_paper_results_phi-1_5.csv
│   ├── joint_paper_results_phipaca.csv
│   ├── joint_paper_results_saiphipaca.csv
│   ├── single_paper_results_phi-1_5.csv
│   ├── single_paper_results_phipaca.csv
│   └── single_paper_results_saiphipaca.csv
├── run_tests.sh
└── src
    ├── benchmark_scores
    │   ├── cosine_similarity_benchmark_data.py
    │   ├── RAG_test_GPT.py
    │   ├── RAG_test.py
    │   └── results_plots.py
    ├── data_generation
    │   ├── cosine_similarity_training_data.py
    │   ├── data_fix.py
    │   ├── synthetic_data_generator.py
    │   └── synthetic_data_prep.py
    └── model_training
        ├── phipaca_train.ipynb
        └── saiphipaca_train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

David and Goliath: Domain-specific instruction fine-tuning of a lightweight LLM (phi-1.5) on synthetic data for use in RAG applications

Abstract from Wael & Baskakovs (2023):

Project Structure

Installation

Usage

Data generation

Model training

RAG benchmarking

License

Full repository structure

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
models		models
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_generation.sh		data_generation.sh
project_scope.png		project_scope.png
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
setup.sh		setup.sh

License

alekswael/saiphipaca-RAG

Folders and files

Latest commit

History

Repository files navigation

David and Goliath: Domain-specific instruction fine-tuning of a lightweight LLM (phi-1.5) on synthetic data for use in RAG applications

Abstract from Wael & Baskakovs (2023):

Project Structure

Installation

Usage

Data generation

Model training

RAG benchmarking

License

Full repository structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages