Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.
- Python 3.x
- PyPDF2
- chromadb
- langchain
- Clone the repository:
git clone https://github.com/your-username/pdf-to-text-chroma-search.git
- Install the required dependencies:
pip install PyPDF2 chromadb langchain
- Place your PDF files in the
input
directory. - Run the following command to convert the PDFs to text, split them into chunks, and store their vector representations in the Chroma DB:
python write_script.py
- Run the following command to load the Chroma DB and query user input:
python read_script.py
- Enter your query when prompted.