Skip to content

Latest commit

 

History

History
84 lines (66 loc) · 2.44 KB

README.md

File metadata and controls

84 lines (66 loc) · 2.44 KB

Dataset Generator for Fine-Tuning

This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs from the data you have. The system leverages various modules to extract text, generate questions using a language model, and save the generated questions.

Latest Update

  • Added a feature to prcoess HTML as input files.
  • Added a feature to remove duplicate and similar questions.
  • Simplified the JSONL ouput format cleaning process.

Architecture Diagram

Architecture diagram

Table of Contents

Supported Inference Engine

  1. VLLM
  2. OpenAI API
  3. Azure OpenAI API
  4. Ollama

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/question-generation.git
cd question-generation
  1. Create a virtual environment and activate it:
python3.11 -m venv .venv
source .venv/bin/activate # On Windows use `venv\Scripts\activate`
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Copy the example environment file and configure it:
cp .env.example .env
  1. Update the .env file with your API URL and API Key.

Configuration

The configuration for the model is specified in the config.json file. You can update the model name or other parameters as needed:

{   
    "inference_engine": "azure", # inference engine name here
    "model_name": "llama3.1", # model name here
    "model_max_tokens": 10000, # model's max tokens here
    "input_folder": "input_data", # input data location
    "output_folder": "generated_questions", # output data location
    "chroma_db_path": "chromadb", # vector db location
    "chroma_collection_name": "questions", # vectordb collection name
    "duplicate_threshold": 0.1 # duplicate checking threshold
}

Usage

  1. Place your input files in the input_data folder.

  2. To run the question generation process, execute the main.py script:

python main.py

Prompts

  • The system prompt for generating question-answer pairs is located in the prompts folder as generateQA-sys_prompt.txt

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

License

This project is licensed under the Apache-2.0 license. See the LICENSE file for details.