Chat-Any-Site

Description

A small Python project to ask a chatbot questions about any website (with a sitemap), using an LLM service and local vector store.

This uses OpenAI models (so you must have API access) and Chroma (an in-memory vector store) to store the embeddings of the site data locally.

Motivation

A good excuse to play with LangChain
OpenAI models, like GPT-4, we're only trained up until Sep 2021. So, if you ask if question about anything more recent, it won't have a clue. This project let's you do exactly that.
Publicly available models were only trained on public data. I want a chat bot that can answer questions about sensitive or private data. This can do exactly that. For example, by passing in the sitemap of a private wiki, like Confluence.

Before You Begin

Make sure you have an OpenAI API Key

Respect any wesbite's robot.txt and ai.txt. In fact, maybe just only use this on websites that you own.

Setup and Installation

This project was developed using Python 3.10

Follow these steps to install and run the project:

Clone the repository:

git clone https://github.com/mkwatson/chat_any_site.git

Change into the project directory:

cd chat_any_site

Create a virtual environment and activate it (optional, but recommended):

python3.10 -m venv env
source env/bin/activate  # On Windows, use `env\Scripts\activate`

Install the project dependencies:

pip install .

Usage

To run the command-line interface, execute the script:

chatanysite

There are two arguments you must pass:

OpenAI API Key (defaults to the OPENAI_API_KEY environment variable)
A valid sitemap.xml
OpenAI model you want to use

You can also pass some or all in like

chatanysite \
  --open-api-key=<your openai api key> \
  --site=https://<host>/sitemap.xml \
  --model=gpt-4

Demo

In this demo I'm passing along the sitemap to the LangChain Documentation. LangChain was released on Oct 2022. I'm also using GPT-4, which was only trained on data up to Sep 2021. So, the model, as is, does not know about LangChain. Nevertheless, I'm able to get expert responses about LangChain.

Higher quality YouTube version

Known Limitation

Because the vector data is stored locally in-memory, it's only transient.
It can take a long time to download all the pages listed in the sitemap.

Next Steps

Add the ability to store the vector data in a remote persistent data store
Make a web client
What if you made a Google user, and then if you added it as read-only to any Google Doc or file on Google drive it sucked it down and you could ask questions about it?
Probably at least one test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Chat-Any-Site

Description

Motivation

Before You Begin

Setup and Installation

Usage

Demo

Known Limitation

Next Steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

Chat-Any-Site

Description

Motivation

Before You Begin

Setup and Installation

Usage

Demo

Known Limitation

Next Steps