Summarize-and-Chat project build and development setup

The project includes three components:

summarization-client: Angular/Clarity web application for content management, summary generation and chat.
summarization-server: FastAPI gateway server to manage core application functions including access control, document ingestion pipeline,summarization with Map Reduce provided by LangChain, and improved RAG with LlamaIndex Fusion Retriever.
stt-service: Speech-to-text microservice to convert audio to text using OpenAI’s faster-whisper API

Tools used

Building requires:

Angular CLI = 16.1.4.
[Python = 3.10+] (https://www.python.org/downloads/)
Postgres = 12+

Before You Start

Running LLM inference engine on vLLM

vLLM is a popular open-source LLM inference engine. To run an open-source LLM on vLLM in OpenAI-compatible mode, make sure you have an A100 (40GB) GPU available at the OS-level and CUDA 12.1 installed. Then you need to run the following commands to make the LLM service available from http://localhost:8010/v1:

      # (Optional) Create a new conda environment.
      conda create -n vllm-env python=3.9 -y
      conda activate vllm-env

      # Install vLLM with CUDA 12.1.
      pip install vllm

      # Serve the zephyr-7b-alpha LLM
      python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --port 8010 --enforce-eager

Running PGVector

The vector store is implemented using the PGVector extension of PostgreSQL (v12).

$ cd summarization-server/pgvector
$ run the `docker compose up -d` script to launch a PGVector instance using Docker Compose.

The `docker-compose.yaml` file defines the PostgreSQL configuration, which you can customize according to your preferences.

You can execute the run_pgvector.sh script to pull and launch a PostgreSQL + PGVector Docker container. Once up and running, the DB engine will be available from localhost:5432

Installation

# clone the repo
$ git clone https://github.com/vmware/summarize-and-chat

# install summarization-client
$ cd summarization-client
$ npm install

# install summarization-server
$ cd summarization-server
$ python3 -m venv .venv   # create a virtual environments
$ source .venv/bin/activate    # windows: .venv\Scripts\activate
$ pip install -r requirements.txt

# install stt-service
$ cd stt-service
$ python3 -m venv .venv   # create a virtual environments
$ source .venv/bin/activate    # windows: .venv\Scripts\activate
$ pip install -r requirements.txt

Configuration

summarization-client

You need to set the following required variables in the summarization-client/src/environments/environment.ts file to run the summarization-client locally.

export const environment: Env = {
  // This section is required
  production: false,
  // Sumarization service url
  serviceUrl: "http://localhost:8000",
  // Okta authentication server
  ssoIssuer: "https://your-org.okta.com/oauth2/default", 
  // Okta client ID
  ssoClientId: 'your-okta-client-id', 
  // Login redirect URL
  redirectUrl:'http://localhost:4200/login/'
  
};

To configure specific environments for dev, staging, production, go to summarization-client/src/environments folder and set variables in different environments.

summarization-server

You need to set the following required variables in the summarization-server/src/config/config.yaml file to run the summarization-server locally.

Set up Okta configuration

okta:
  OKTA_AUTH_URL: "Okta auth URL"
  OKTA_CLIENT_ID: "Okta client ID"
  OKTA_ENDPOINTS: [ 'admin' ]

Set up LLM configuration

llm:
  LLM_API: "your LLM api server" # https://api.openai.com/v1"
  AUTH_KEY: "your api key"
  QA_MODEL: "default QA model" # mistralai/Mixtral-8x7B-Instruct-v0.1"
  QA_MODEL_MAX_TOKEN_LIMIT: "max token limit for QA model" #30000
  EMBEDDING_MODEL: "embedding model" # "Salesforce/SFR-Embedding-Mistral"
  VECTOR_DIM: "embedding model vector dimension" # 4096 
  SIMIL_TOP_K: 10 # Retrieve TOP_K most similar docs from the PGVector store
  RERANK_ENABLED: True
  RERANK_MODEL: "BAAI/bge-reranker-large" # re-ranking model
  RERANK_TOP_N: 5 # Rerank and pick the 5 most similar docs
  MAX_COMPLETION: "max tokens of completion for each query" #700
  CHUNK_SIZE: "default chunk size" # 512
  CHUNK_OVERLAP: "default chunk overlap" # 20
  NUM_QUERIES: "default number of queries" # 3
  LLM_BATCH_SIZE: "batch size for LLM" # 5

You also need to specify the available LLMs for the summarization task in the summarization-server/src/config/models.json.

{
    "models": [
        {
            "name": "meta-llama/Meta-Llama-3-70B-Instruct",
            "display_name": "LLAMA 3 - 70B",
            "max_token": 6500
        },
        {
            "name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
            "display_name": "LLAMA 3.1 - 70B",
            "max_token": 128000
        },
        {
            "name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
            "display_name": "Mixtral - 8x7B",
            "max_token": 30000
        },
        {
            "name": "mistralai/Mistral-7B-Instruct-v0.2",
            "display_name": "Mistral - 7B",
            "max_token": 30000
        }
    ]
}

Set up Database configuration

database:
  PG_HOST: "Database host" #"localhost"
  PG_PORT: 5432
  PG_USER: DB_USER
  PG_PASSWD: DB_PASSWORD
  PG_DATABASE: "your database name" #summerizer
  PG_TABLE: "pgvector embedding table" #embeddings
  PG_VECTOR_DIM: "your embedding model vector dimension" # match the vector dimension of the embedding model

Set up server configuration

server:
  HOST: "0.0.0.0"
  PORT: 5000
  NUM_WORKERS: 1
  PDF_READER: pypdf # default PDF parser
  FILE_PATH:  "../data"
  RELOAD: False

If you are want to enable Speech-to-text function, you need to set stt configurations in the summarization-server/src/config/config.yaml file.

stt:
  STT_API: "http://localhost:9000/api/v1" # STT-server URL
  AUTH_KEY: "your STT api auth key if the auth is enabled"

If you are want to enable email notifications, you need to set email server configurations in the summarization-server/src/config/config.yaml file.

email:
  SMTP_SERVER: "your smtp server"
  SMTP_SENDER: "your sender email"

stt-service

If you are a personal user, and just run the code on the local machine, you can use the default settings, don't need to set up any configs.
If you are an organization user and want to deploy code to the server, we recommend you set the following required variables and some optional variables in stt-service/config/config.yaml file to run the stt-service.
Set the following required auth variables if you enable authentication.

auth:
  ENABLED: True
  AUTH_URL: "your api auth url"
  CACHE_TIMEOUT: 86400 #  1 day

Set the model variables if you want to use a different model or run on GPU device.

model:
  MODEL_SIZE: "small"
  COMPUTE_TYPE: "int8"
  DEVICE: "cpu" # "cuda" if on GPU
  DEVICE_INDEX: 1

Set the server variables

server:
  HOST: "0.0.0.0"
  PORT: 9000
  SERVER_WORKERS: 1
  MAX_WORKS: 3
  RELOAD: False
  DEVICE_INDEX: 1
  CPU_THREADS: 1
  NUM_WORKERS: 1
  FILE_PATH: "file_path same as summarization-server"
  SUMMARIZATION_SERVER: "summarization-server URL for notification" #"http://localhost:8000"
  AUDIO_SIZE_LIMITE: "audio file size limit" # 50*1024*1024

Run at Local

After the installation and configuration, you can run the Summarize-and-chat application as follow:

# run summarization-client
$ cd summarization-client
$ ng serve

# run summarization-server
$ cd summarization-server
$ uvicorn main:app --reload

# run stt-service
$ cd stt-service
$ uvicorn main:app --reload

How to use

Open http://localhost:4200 with your browser, now you can use full of the Summarize-and-Chat application functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUILD.md

BUILD.md

Summarize-and-Chat project build and development setup

Tools used

Before You Start

Running LLM inference engine on vLLM

Running PGVector

Installation

Configuration

summarization-client

summarization-server

stt-service

Run at Local

How to use

Files

BUILD.md

Latest commit

History

BUILD.md

File metadata and controls

Summarize-and-Chat project build and development setup

Tools used

Before You Start

Running LLM inference engine on vLLM

Running PGVector

Installation

Configuration

summarization-client

summarization-server

stt-service

Run at Local

How to use