Skip to content

Fine-tuning an LLM using a Generic Workflow and Best Practices with PyTorch

Notifications You must be signed in to change notification settings

mddunlap924/PyTorch-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch Workflow for Large Language Models (LLM)

Utilize this repository for a basic framework to tailor Large Language Models (LLM) with PyTorch.

Introduction  •  Getting Started  •  Generic Workflow  •  Use Case  •  Deep Learning Techniques  •  Issues  •  TODOs

Introduction

This workflow helps you get accustomed to LLM project structure and PyTorch for custom model creation, showcasing a multi-class classification using a public dataset and an LLM model from Hugging Face Hub.

Workflow Advantages

Key advantages of this workflow not commonly found elsewhere include:

  • PyTorch Models: It employs a custom PyTorch class for LLM fine-tuning, allowing custom layers, activation functions, layer freezing, model heads, loss functions, etc. through a PyTorch Module, unlike typical HuggingFace Tasks.
  • Python Modules and Directory Structure: The organized directory structure supports Python modules and config files for versatility, inspired by Joel Grus' presentation on Jupyter Notebooks.
  • Configuration Files for Input Parameters: For script execution via CLI or Cron scheduling, configuration files enable flexible pipeline variations and automated execution.
  • Updated PyTorch and LLM Packages: This workflow includes recent NLP advancements and open-source software, like the post-2022 release of ChatGPT.
  • Integrated Feature Set: The repository provides a comprehensive feature set for quick pipeline development and modification.

NOTE: This workflow can be adapted for many PyTorch deep learning applications, not just LLMs.

Getting Started

To understand this workflow, proceed with the use case in the following order:

Review this EDA - Jupyter Notebook for a brief exploration of the CFPB data, featuring model features, target distributions, text tokens count, and data reduction.

Use this notebook to train a model via a single configuration file, with supplementary pre-training tasks and further analysis techniques for model selection.

This script offers robust long-term training routines across various configuration files and can be paired with this bash shell script for full automation of model development and experiments, ideal for prolonged runs and allowing your computer to work autonomously.

Generic Workflow

THe Pseudo Code provided below guides this repository and outlines a cross-validation training process using PyTorch.

INPUT: YAML config. file
OUTPUT: Model checkpoints, training log

1. Load YAML config.
2. C.V. data Folds
3. Loop over each data fold:
  A.) Training module
    * Dataloader with custom preprocessing and collator
    * Train a custom PyTorch model
    * Standard PyTorch training loop with: save checkpoints, log training metrics, etc.

The standard PyTorch training loop, shown below, is used here. Additional modifications are implemented in the training loop to improve model performance and training/inference speed are also implemented.

# loop through batches
for (inputs, labels) in data_loader:

    # extract inputs and labels
    inputs = inputs.to(device)
    labels = labels.to(device)

    # passes and weights update
    with torch.set_grad_enabled(True):
        
        # forward pass 
        preds = model(inputs)
        loss  = criterion(preds, labels)

        # backward pass
        loss.backward() 

        # weights update
        optimizer.step()
        optimizer.zero_grad()

Use Case

The NLP dataset used here is obtained from The Consumer Financial Protection Bureau (CFPB), available on Kaggle, featuring consumer complaints about financial providers.

Model Training Objective

We're performing multi-class classification on this dataset, where the five product categories represent the target variable, and three source variables are used as input for the LLM model.

  • NOTE: The input variables used in this example include unstructured text and categorical variables, showcasing how to combine mixed data types for LLM model fine-tuning, while selection of these variables for prediction performance wasn't the primary focus.

Metrics

The classification performance was evaluated using MultiClass: F1 Score, Precision, and Recall, but other metrics could be used as well.

Deep Learning Techniques

Below are a list of deep learning techniques and tools utilized throughout this repository.

Issues

This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or make a Pull Request. 😃

TODOs

Liked the work? Please give a star!

About

Fine-tuning an LLM using a Generic Workflow and Best Practices with PyTorch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published