Skip to content

Latest commit

 

History

History
268 lines (204 loc) · 8.39 KB

BUILD.md

File metadata and controls

268 lines (204 loc) · 8.39 KB

Summarize-and-Chat project build and development setup

The project includes three components:

  • summarization-client: Angular/Clarity web application for content management, summary generation and chat.
  • summarization-server: FastAPI gateway server to manage core application functions including access control, document ingestion pipeline,summarization with Map Reduce provided by LangChain, and improved RAG with LlamaIndex Fusion Retriever.
  • stt-service: Speech-to-text microservice to convert audio to text using OpenAI’s faster-whisper API

Tools used

Building requires:

Before You Start

Running LLM inference engine on vLLM

vLLM is a popular open-source LLM inference engine. To run an open-source LLM on vLLM in OpenAI-compatible mode, make sure you have an A100 (40GB) GPU available at the OS-level and CUDA 12.1 installed. Then you need to run the following commands to make the LLM service available from http://localhost:8010/v1:

      # (Optional) Create a new conda environment.
      conda create -n vllm-env python=3.9 -y
      conda activate vllm-env

      # Install vLLM with CUDA 12.1.
      pip install vllm

      # Serve the zephyr-7b-alpha LLM
      python -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-alpha --port 8010 --enforce-eager

Running PGVector

The vector store is implemented using the PGVector extension of PostgreSQL (v12).

$ cd summarization-server/pgvector
$ run the `docker compose up -d` script to launch a PGVector instance using Docker Compose.

The `docker-compose.yaml` file defines the PostgreSQL configuration, which you can customize according to your preferences.

You can execute the run_pgvector.sh script to pull and launch a PostgreSQL + PGVector Docker container. Once up and running, the DB engine will be available from localhost:5432

Installation

# clone the repo
$ git clone https://github.com/vmware/summarize-and-chat

# install summarization-client
$ cd summarization-client
$ npm install

# install summarization-server
$ cd summarization-server
$ python3 -m venv .venv   # create a virtual environments
$ source .venv/bin/activate    # windows: .venv\Scripts\activate
$ pip install -r requirements.txt

# install stt-service
$ cd stt-service
$ python3 -m venv .venv   # create a virtual environments
$ source .venv/bin/activate    # windows: .venv\Scripts\activate
$ pip install -r requirements.txt 

Configuration

summarization-client

You need to set the following required variables in the summarization-client/src/environments/environment.ts file to run the summarization-client locally.

export const environment: Env = {
  // This section is required
  production: false,
  // Sumarization service url
  serviceUrl: "http://localhost:8000",
  // Okta authentication server
  ssoIssuer: "https://your-org.okta.com/oauth2/default", 
  // Okta client ID
  ssoClientId: 'your-okta-client-id', 
  // Login redirect URL
  redirectUrl:'http://localhost:4200/login/'
  
};

To configure specific environments for dev, staging, production, go to summarization-client/src/environments folder and set variables in different environments.


summarization-server

You need to set the following required variables in the summarization-server/src/config/config.yaml file to run the summarization-server locally.

  • Set up Okta configuration
okta:
  OKTA_AUTH_URL: "Okta auth URL"
  OKTA_CLIENT_ID: "Okta client ID"
  OKTA_ENDPOINTS: [ 'admin' ]
  • Set up LLM configuration
llm:
  LLM_API: "your LLM api server" # https://api.openai.com/v1"
  AUTH_KEY: "your api key"
  QA_MODEL: "default QA model" # mistralai/Mixtral-8x7B-Instruct-v0.1"
  QA_MODEL_MAX_TOKEN_LIMIT: "max token limit for QA model" #30000
  EMBEDDING_MODEL: "embedding model" # "Salesforce/SFR-Embedding-Mistral"
  VECTOR_DIM: "embedding model vector dimension" # 4096 
  SIMIL_TOP_K: 10 # Retrieve TOP_K most similar docs from the PGVector store
  RERANK_ENABLED: True
  RERANK_MODEL: "BAAI/bge-reranker-large" # re-ranking model
  RERANK_TOP_N: 5 # Rerank and pick the 5 most similar docs
  MAX_COMPLETION: "max tokens of completion for each query" #700
  CHUNK_SIZE: "default chunk size" # 512
  CHUNK_OVERLAP: "default chunk overlap" # 20
  NUM_QUERIES: "default number of queries" # 3
  LLM_BATCH_SIZE: "batch size for LLM" # 5

You also need to specify the available LLMs for the summarization task in the summarization-server/src/config/models.json.

{
    "models": [
        {
            "name": "meta-llama/Meta-Llama-3-70B-Instruct",
            "display_name": "LLAMA 3 - 70B",
            "max_token": 6500
        },
        {
            "name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
            "display_name": "LLAMA 3.1 - 70B",
            "max_token": 128000
        },
        {
            "name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
            "display_name": "Mixtral - 8x7B",
            "max_token": 30000
        },
        {
            "name": "mistralai/Mistral-7B-Instruct-v0.2",
            "display_name": "Mistral - 7B",
            "max_token": 30000
        }
    ]
}
  • Set up Database configuration
database:
  PG_HOST: "Database host" #"localhost"
  PG_PORT: 5432
  PG_USER: DB_USER
  PG_PASSWD: DB_PASSWORD
  PG_DATABASE: "your database name" #summerizer
  PG_TABLE: "pgvector embedding table" #embeddings
  PG_VECTOR_DIM: "your embedding model vector dimension" # match the vector dimension of the embedding model
  • Set up server configuration
server:
  HOST: "0.0.0.0"
  PORT: 5000
  NUM_WORKERS: 1
  PDF_READER: pypdf # default PDF parser
  FILE_PATH:  "../data"
  RELOAD: False
stt:
  STT_API: "http://localhost:9000/api/v1" # STT-server URL
  AUTH_KEY: "your STT api auth key if the auth is enabled"
email:
  SMTP_SERVER: "your smtp server"
  SMTP_SENDER: "your sender email"

stt-service

  • If you are a personal user, and just run the code on the local machine, you can use the default settings, don't need to set up any configs.

  • If you are an organization user and want to deploy code to the server, we recommend you set the following required variables and some optional variables in stt-service/config/config.yaml file to run the stt-service.

  • Set the following required auth variables if you enable authentication.

auth:
  ENABLED: True
  AUTH_URL: "your api auth url"
  CACHE_TIMEOUT: 86400 #  1 day
  • Set the model variables if you want to use a different model or run on GPU device.
model:
  MODEL_SIZE: "small"
  COMPUTE_TYPE: "int8"
  DEVICE: "cpu" # "cuda" if on GPU
  DEVICE_INDEX: 1
  • Set the server variables
server:
  HOST: "0.0.0.0"
  PORT: 9000
  SERVER_WORKERS: 1
  MAX_WORKS: 3
  RELOAD: False
  DEVICE_INDEX: 1
  CPU_THREADS: 1
  NUM_WORKERS: 1
  FILE_PATH: "file_path same as summarization-server"
  SUMMARIZATION_SERVER: "summarization-server URL for notification" #"http://localhost:8000"
  AUDIO_SIZE_LIMITE: "audio file size limit" # 50*1024*1024

Run at Local

After the installation and configuration, you can run the Summarize-and-chat application as follow:

# run summarization-client
$ cd summarization-client
$ ng serve

# run summarization-server
$ cd summarization-server
$ uvicorn main:app --reload

# run stt-service
$ cd stt-service
$ uvicorn main:app --reload

How to use

Open http://localhost:4200 with your browser, now you can use full of the Summarize-and-Chat application functions.