Dataset Generator for Fine-Tuning

This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs from the data you have. The system leverages various modules to extract text, generate questions using a language model, and save the generated questions.

Latest Update

Added a feature to prcoess HTML as input files.
Added a feature to remove duplicate and similar questions.
Simplified the JSONL ouput format cleaning process.

Architecture Diagram

Supported Inference Engine

VLLM
OpenAI API
Azure OpenAI API
Ollama

Installation

Clone the repository:

git clone https://github.com/yourusername/question-generation.git
cd question-generation

Create a virtual environment and activate it:

python3.11 -m venv .venv
source .venv/bin/activate # On Windows use `venv\Scripts\activate`

Install the required dependencies:

pip install -r requirements.txt

Copy the example environment file and configure it:

cp .env.example .env

Update the .env file with your API URL and API Key.

Configuration

The configuration for the model is specified in the config.json file. You can update the model name or other parameters as needed:

{   
    "inference_engine": "azure", # inference engine name here
    "model_name": "llama3.1", # model name here
    "model_max_tokens": 10000, # model's max tokens here
    "input_folder": "input_data", # input data location
    "output_folder": "generated_questions", # output data location
    "chroma_db_path": "chromadb", # vector db location
    "chroma_collection_name": "questions", # vectordb collection name
    "duplicate_threshold": 0.1 # duplicate checking threshold
}

Usage

Place your input files in the input_data folder.
To run the question generation process, execute the main.py script:

python main.py

Prompts

The system prompt for generating question-answer pairs is located in the prompts folder as generateQA-sys_prompt.txt

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

License

This project is licensed under the Apache-2.0 license. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset Generator for Fine-Tuning

Latest Update

Architecture Diagram

Table of Contents

Supported Inference Engine

Installation

Configuration

Usage

Prompts

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Generator for Fine-Tuning

Latest Update

Architecture Diagram

Table of Contents

Supported Inference Engine

Installation

Configuration

Usage

Prompts

Contributing

License