AutoQAPairGen is a system designed to simplify the process of preparing data for fine-tuning machine learning models. It generates question-answer pairs from unstructured data (such as PDFs) based on user-defined schemas. This tool allows you to convert unstructured data into a structured format that aligns with your schema, making it easier to prepare data for training or fine-tuning both language and vision models.
Read more about the project here: blog
- Generate question-answer pairs from unstructured data (including PDFs) based on custom schemas.
- Capable of chunking data without losing continuity, ensuring seamless processing of large or complex documents.
- Handles embedded images and includes them in the corresponding data chunks for processing.
- Automatically generates structured question-answer pairs in line with user-defined schemas.
- Fully open-source, customizable, and easily extendable for specific use cases.
- GPU support: since we are using the open source model, we need to have a GPU to run the model.
- Python 3.10 or higher.
To get started, follow these steps:
git clone https://github.com/ragesh2000/AutoQAPairGen.git
cd AutoQAPairGen
conda create -n qa-gen python=3.10
conda activate qa-gen
pip install -r requirements.txt
You can run the application with the following command:
python run.py -category <text/vision>
Note: The -category
is an optional argument to specify whether you are processing data for a language model (text) or a vision model (vision). If not provided, the system will default to processing data for a language model.
- Run the application
- Provide the path for the unstructured data (currently only pdf is supported)
- Provide the schema for the question-answer pairs (eg: {'user': question, 'bot': answer})
- The application will generate the question-answer pairs based on the schema