The project includes three components:
- summarization-client: Angular/Clarity web application for content management, summary generation and chat.
- summarization-server: FastAPI gateway server to manage core application functions including access control, document ingestion pipeline,summarization with Map Reduce provided by LangChain, and improved RAG with LlamaIndex Fusion Retriever.
- stt-service: Speech-to-text microservice to convert audio to text using OpenAI’s faster-whisper API
Building requires:
- Angular CLI = 16.1.4.
- [Python = 3.10+] (https://www.python.org/downloads/)
- Postgres = 12+
vLLM is a popular open-source LLM inference engine. To run an open-source LLM on vLLM in OpenAI-compatible mode, make sure you have an A100 (40GB) GPU available at the OS-level and CUDA 12.1 installed. Then you need to run the following commands to make the LLM service available from http://localhost:8010/v1:
# (Optional) Create a new conda environment.
conda create -n vllm-env python=3.9 -y
conda activate vllm-env
# Install vLLM with CUDA 12.1.
pip install vllm
# Serve the zephyr-7b-alpha LLM
python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --port 8010 --enforce-eager
The vector store is implemented using the PGVector extension of PostgreSQL (v12).
$ cd summarization-server/pgvector
$ run the `docker compose up -d` script to launch a PGVector instance using Docker Compose.
The `docker-compose.yaml` file defines the PostgreSQL configuration, which you can customize according to your preferences.
You can execute the run_pgvector.sh script to pull and launch a PostgreSQL + PGVector Docker container. Once up and running, the DB engine will be available from localhost:5432
# clone the repo
$ git clone https://github.com/vmware/summarize-and-chat
# install summarization-client
$ cd summarization-client
$ npm install
# install summarization-server
$ cd summarization-server
$ python3 -m venv .venv # create a virtual environments
$ source .venv/bin/activate # windows: .venv\Scripts\activate
$ pip install -r requirements.txt
# install stt-service
$ cd stt-service
$ python3 -m venv .venv # create a virtual environments
$ source .venv/bin/activate # windows: .venv\Scripts\activate
$ pip install -r requirements.txt
You need to set the following required variables in the summarization-client/src/environments/environment.ts file to run the summarization-client locally.
export const environment: Env = {
// This section is required
production: false,
// Sumarization service url
serviceUrl: "http://localhost:8000",
// Okta authentication server
ssoIssuer: "https://your-org.okta.com/oauth2/default",
// Okta client ID
ssoClientId: 'your-okta-client-id',
// Login redirect URL
redirectUrl:'http://localhost:4200/login/'
};
To configure specific environments for dev, staging, production, go to summarization-client/src/environments folder and set variables in different environments.
You need to set the following required variables in the summarization-server/src/config/config.yaml file to run the summarization-server locally.
- Set up Okta configuration
okta:
OKTA_AUTH_URL: "Okta auth URL"
OKTA_CLIENT_ID: "Okta client ID"
OKTA_ENDPOINTS: [ 'admin' ]
- Set up LLM configuration
llm:
LLM_API: "your LLM api server" # https://api.openai.com/v1"
AUTH_KEY: "your api key"
QA_MODEL: "default QA model" # mistralai/Mixtral-8x7B-Instruct-v0.1"
QA_MODEL_MAX_TOKEN_LIMIT: "max token limit for QA model" #30000
EMBEDDING_MODEL: "embedding model" # "Salesforce/SFR-Embedding-Mistral"
VECTOR_DIM: "embedding model vector dimension" # 4096
SIMIL_TOP_K: 10 # Retrieve TOP_K most similar docs from the PGVector store
RERANK_ENABLED: True
RERANK_MODEL: "BAAI/bge-reranker-large" # re-ranking model
RERANK_TOP_N: 5 # Rerank and pick the 5 most similar docs
MAX_COMPLETION: "max tokens of completion for each query" #700
CHUNK_SIZE: "default chunk size" # 512
CHUNK_OVERLAP: "default chunk overlap" # 20
NUM_QUERIES: "default number of queries" # 3
LLM_BATCH_SIZE: "batch size for LLM" # 5
You also need to specify the available LLMs for the summarization task in the summarization-server/src/config/models.json.
{
"models": [
{
"name": "meta-llama/Meta-Llama-3-70B-Instruct",
"display_name": "LLAMA 3 - 70B",
"max_token": 6500
},
{
"name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"display_name": "LLAMA 3.1 - 70B",
"max_token": 128000
},
{
"name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"display_name": "Mixtral - 8x7B",
"max_token": 30000
},
{
"name": "mistralai/Mistral-7B-Instruct-v0.2",
"display_name": "Mistral - 7B",
"max_token": 30000
}
]
}
- Set up Database configuration
database:
PG_HOST: "Database host" #"localhost"
PG_PORT: 5432
PG_USER: DB_USER
PG_PASSWD: DB_PASSWORD
PG_DATABASE: "your database name" #summerizer
PG_TABLE: "pgvector embedding table" #embeddings
PG_VECTOR_DIM: "your embedding model vector dimension" # match the vector dimension of the embedding model
- Set up server configuration
server:
HOST: "0.0.0.0"
PORT: 5000
NUM_WORKERS: 1
PDF_READER: pypdf # default PDF parser
FILE_PATH: "../data"
RELOAD: False
- If you are want to enable Speech-to-text function, you need to set stt configurations in the summarization-server/src/config/config.yaml file.
stt:
STT_API: "http://localhost:9000/api/v1" # STT-server URL
AUTH_KEY: "your STT api auth key if the auth is enabled"
- If you are want to enable email notifications, you need to set email server configurations in the summarization-server/src/config/config.yaml file.
email:
SMTP_SERVER: "your smtp server"
SMTP_SENDER: "your sender email"
-
If you are a personal user, and just run the code on the local machine, you can use the default settings, don't need to set up any configs.
-
If you are an organization user and want to deploy code to the server, we recommend you set the following required variables and some optional variables in stt-service/config/config.yaml file to run the stt-service.
-
Set the following required auth variables if you enable authentication.
auth:
ENABLED: True
AUTH_URL: "your api auth url"
CACHE_TIMEOUT: 86400 # 1 day
- Set the model variables if you want to use a different model or run on GPU device.
model:
MODEL_SIZE: "small"
COMPUTE_TYPE: "int8"
DEVICE: "cpu" # "cuda" if on GPU
DEVICE_INDEX: 1
- Set the server variables
server:
HOST: "0.0.0.0"
PORT: 9000
SERVER_WORKERS: 1
MAX_WORKS: 3
RELOAD: False
DEVICE_INDEX: 1
CPU_THREADS: 1
NUM_WORKERS: 1
FILE_PATH: "file_path same as summarization-server"
SUMMARIZATION_SERVER: "summarization-server URL for notification" #"http://localhost:8000"
AUDIO_SIZE_LIMITE: "audio file size limit" # 50*1024*1024
After the installation and configuration, you can run the Summarize-and-chat application as follow:
# run summarization-client
$ cd summarization-client
$ ng serve
# run summarization-server
$ cd summarization-server
$ uvicorn main:app --reload
# run stt-service
$ cd stt-service
$ uvicorn main:app --reload
Open http://localhost:4200
with your browser, now you can use full of the Summarize-and-Chat application functions.