Skip to content

Commit

Permalink
add 2nd website (#32)
Browse files Browse the repository at this point in the history
* add 2nd website

* updated for latest env vars; typo

* Poetry update

* bump dependencies

* update README

---------

Co-authored-by: Neil Smyth <[email protected]>
Co-authored-by: Valentin Yanakiev <[email protected]>
  • Loading branch information
3 people authored Dec 12, 2023
1 parent f77963d commit 8a301cc
Show file tree
Hide file tree
Showing 12 changed files with 945 additions and 835 deletions.
18 changes: 10 additions & 8 deletions .azure-template.env
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
OPENAI_API_TYPE=azure
OPENAI_API_BASE=https://alkemio-gpt.openai.azure.com/
OPENAI_API_KEY=azure-openai-key
OPENAI_API_VERSION=2023-05-15
AZURE_OPENAI_ENDPOINT=https://alkemio-gpt.openai.azure.com
AZURE_OPENAI_API_KEY=azure-openai-key
LLM_DEPLOYMENT_NAME=deploy-gpt-35-turbo
EMBEDDINGS_DEPLOYMENT_NAME=embedding
RABBITMQ_HOST=localhost
RABBITMQ_USER=admin
RABBITMQ_PASSWORD=super-secure-pass
AI_MODEL_TEMPERATURE=0.3
AI_MODEL_NAME=gpt-35-turbo
AI_DEPLOYMENT_NAME=deploy-gpt-35-turbo
AI_EMBEDDINGS_DEPLOYMENT_NAME=embedding
AI_SOURCE_WEBSITE=https://www.alkemio.org
AI_SOURCE_WEBSITE2=https://welcome.alkem.io
AI_LOCAL_PATH=~/alkemio/data
AI_WEBSITE_REPO=https://github.com/alkem-io/website.git
AI_WEBSITE_REPO=github.com/alkem-io/website.git
AI_WEBSITE_REPO2=github.com/alkem-io/welcome-site.git
AI_GITHUB_USER=github-user-for-website-cloning
AI_GITHUB_PAT=github-user-for-website-cloning
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY="langsmith-api-key"
LANGCHAIN_PROJECT="guidance-engine"
LANGCHAIN_PROJECT="guidance-engine"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ azure.env
/__pycache__/*
/vectordb/*
local.env
docker-compose-local.yaml
15 changes: 0 additions & 15 deletions .openai-template.env

This file was deleted.

4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ FROM python:3.11-slim-bookworm
# Set the working directory in the container to /app
WORKDIR /app

ARG GO_VERSION=1.21.1
ARG HUGO_VERSION=0.118.2
ARG GO_VERSION=1.21.5
ARG HUGO_VERSION=0.121.1
ARG ARCHITECTURE=amd64

# install git, go and hugo
Expand Down
34 changes: 19 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ There is a draft implementation for the interaction language of the model (this

### Docker
The following command can be used to build the container from the Docker CLI (default architecture is amd64, so `--build-arg ARCHITECTURE=arm64` for amd64 builds):
`docker build --build-arg ARCHITECTURE=arm64 --no-cache -t alkemio/guidance-engine:v0.2.0 .`
`docker build--no-cache -t alkemio/guidance-engine:v0.2.0 .`
`docker build --build-arg ARCHITECTURE=arm64 --no-cache -t alkemio/guidance-engine:v0.4.0 .`
`docker build --no-cache -t alkemio/guidance-engine:v0.2.0 .`
The Dockerfile has some self-explanatory configuration arguments.

The following command can be used to start the container from the Docker CLI:
Expand All @@ -70,22 +70,28 @@ where `.env` based on `.azure-template.env`
Alternatively use `docker-compose up -d`.

with:
- `OPENAI_API_KEY`: a valid OpenAI API key
- `OPENAI_API_TYPE`: a valid OpenAI API type. For Azure, the value is `azure`
- `AZURE_OPENAI_API_KEY`: a valid OpenAI API key
- `OPENAI_API_VERSION`: a valid Azure OpenAI version. At the moment of writing, latest is `2023-05-15`
- `OPENAI_API_BASE`: a valid Azure OpenAI base URL, e.g. `https://{your-azure-resource-name}.openai.azure.com/`
- `AZURE_OPENAI_ENDPOINT`: a valid Azure OpenAI base URL, e.g. `https://{your-azure-resource-name}.openai.azure.com/`
- `RABBITMQ_HOST`: the RabbitMQ host name
- `RABBITMQ_USER`: the RabbitMQ user
- `RABBITMQ_PASSWORD`: the RabbitMQ password
- `AI_MODEL_TEMPERATURE`: the `temperature` of the model, use value between 0 and 1. 1 means more randomized answer, closer to 0 - a stricter one
- `AI_MODEL_NAME`: the model name in Azure
- `AI_DEPLOYMENT_NAME`: the AI gpt model deployment name in Azure
- `AI_EMBEDDINGS_DEPLOYMENT_NAME`: the AI embeddings model deployment name in Azure
- `AI_SOURCE_WEBSITE`: the URL of the website that contains the source data (for references only)
- `LLM_DEPLOYMENT_NAME`: the AI gpt model deployment name in Azure
- `EMBEDDINGS_DEPLOYMENT_NAME`: the AI embeddings model deployment name in Azure
- `AI_SOURCE_WEBSITE`: the URL of the foundation website that contains the source data (for references only)
- `AI_SOURCE_WEBSITE2`: the URL of the welcome website that contains the source data (for references only)
- `AI_LOCAL_PATH`: local file path for storing data
- `AI_WEBSITE_REPO`: url of the Git repository containing the website source data, based on Hugo
- `AI_WEBSITE_REPO`: url of the Git repository containing the foundation website source data, based on Hugo - without https
- `AI_WEBSITE_REPO2`: url of the Git repository containing the welcome website source data, based on Hugo - without https
- `AI_GITHUB_USER` : Github user used for cloning website repos
- `AI_GITHUB_PAT` : Personal access token for cloning website repos
- `LANGCHAIN_TRACING_V2` : enable Langchain tracing
- `LANGCHAIN_ENDPOINT` : Langchain tracing endpoint (e.g. "https://api.smith.langchain.com")
- `LANGCHAIN_API_KEY` : Langchain tracing API key
- `LANGCHAIN_PROJECT` : Langchain tracing project name (e.g. "guidance-engine")

You can find sample values in `.azure-template.env` and `.openai-template.env`. Configure them and create `.env` file with the updated settings.
You can find sample values in `.azure-template.env`. Configure them and create `.env` file with the updated settings.

### Python & Poetry
The project requires Python & Poetry installed. The minimum version dependencies can be found at `pyproject.toml`.
Expand All @@ -102,9 +108,7 @@ The following tasks are still outstanding:
- assess overall quality and performance of the model and make improvements as and when required.
- assess the need to summarize the chat history to avoid exceeding the prompt token limit.
- update the yaml manifest.
- add error handling.
- perform extensive testing, in particular in multi-user scenarios.
- look at improvements of the ingestion. As a minimum the service engine should not consume queries whilst the ingestion is ongoing, as thatwill lead to errors.
- look at the use of `temperature` for the `QARetrievalChain`. It is not so obvious how this is handled.
- look at improvements of the ingestion. As a minimum the service engine should not consume queries whilst the ingestion is ongoing, as that will lead to errors.
- look at the possibility to implement reinforcement learning.
- return the actual LLM costs and token usage for queries.

25 changes: 16 additions & 9 deletions ai_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import AzureOpenAI
from langchain.prompts import PromptTemplate
Expand All @@ -8,7 +8,7 @@
from langchain.chains.conversational_retrieval.prompts import QA_PROMPT
import logging
import def_ingest
from config import config, website_source_path, website_generated_path, vectordb_path, local_path, generate_website, LOG_LEVEL
from config import config, website_source_path, website_generated_path, website_source_path2, website_generated_path2, vectordb_path, local_path, generate_website, LOG_LEVEL

import os

Expand All @@ -17,7 +17,7 @@

# Create handlers
c_handler = logging.StreamHandler()
f_handler = logging.FileHandler(local_path+'/app.log')
f_handler = logging.FileHandler(os.path.join(os.path.expanduser(local_path),'app.log'))

c_handler.setLevel(level=getattr(logging, LOG_LEVEL))
f_handler.setLevel(logging.ERROR)
Expand Down Expand Up @@ -118,12 +118,18 @@ def get_language_by_code(language_code):
template=chat_template, input_variables=["question", "context", "chat_history"]
)

generic_llm = AzureOpenAI(deployment_name=os.environ["AI_DEPLOYMENT_NAME"], model_name=os.environ["AI_MODEL_NAME"],

generic_llm = AzureOpenAI(azure_deployment=os.environ["LLM_DEPLOYMENT_NAME"],
temperature=0, verbose=verbose_models)

question_generator = LLMChain(llm=generic_llm, prompt=custom_question_prompt, verbose=verbose_models)

embeddings = OpenAIEmbeddings(deployment=os.environ["AI_EMBEDDINGS_DEPLOYMENT_NAME"], chunk_size=1)

embeddings = AzureOpenAIEmbeddings(
azure_deployment=config['embeddings_deployment_name'],
openai_api_version=config['openai_api_version'],
chunk_size=1
)

# Check if the vector database exists
if os.path.exists(vectordb_path+"/index.pkl"):
Expand All @@ -132,19 +138,20 @@ def get_language_by_code(language_code):
# ingest data
if generate_website:
def_ingest.clone_and_generate(config['website_repo'], website_generated_path, website_source_path)
def_ingest.mainapp(config['source_website'])
def_ingest.clone_and_generate(config['website_repo2'], website_generated_path2, website_source_path2)
def_ingest.mainapp(config['source_website'], config['source_website2'])

vectorstore = FAISS.load_local(vectordb_path, embeddings)
retriever = vectorstore.as_retriever()

chat_llm = AzureChatOpenAI(deployment_name=os.environ["AI_DEPLOYMENT_NAME"],
model_name=os.environ["AI_MODEL_NAME"], temperature=os.environ["AI_MODEL_TEMPERATURE"],
chat_llm = AzureChatOpenAI(azure_deployment=os.environ["LLM_DEPLOYMENT_NAME"],
temperature=os.environ["AI_MODEL_TEMPERATURE"],
max_tokens=max_token_limit)

doc_chain = load_qa_chain(generic_llm, chain_type="stuff", prompt=QA_PROMPT, verbose=verbose_models)

def translate_answer(answer, language):
translate_llm = AzureChatOpenAI(deployment_name=os.environ["AI_DEPLOYMENT_NAME"], model_name=os.environ["AI_MODEL_NAME"],
translate_llm = AzureChatOpenAI(azure_deployment=os.environ["LLM_DEPLOYMENT_NAME"],
temperature=0, verbose=verbose_models)
prompt = translation_prompt.format(answer=answer, language=language)
return translate_llm(prompt)
Expand Down
9 changes: 5 additions & 4 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import ai_utils
import logging
import def_ingest
from config import config, website_source_path, website_generated_path, vectordb_path, generate_website, local_path, LOG_LEVEL
from config import config, website_source_path, website_generated_path, website_source_path2, website_generated_path2, vectordb_path, generate_website, local_path, LOG_LEVEL

# configure logging
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -95,9 +95,10 @@ def reset(user_id):
}
return "Reset function executed"

def ingest(source_url, website_repo, destination_path, source_path):
def ingest(source_url, website_repo, destination_path, source_path, source_url2, website_repo2, destination_path2, source_path2):
def_ingest.clone_and_generate(website_repo, destination_path, source_path)
def_ingest.mainapp(source_url)
def_ingest.clone_and_generate(website_repo2, destination_path2, source_path2)
def_ingest.mainapp(source_url, source_url2)

return "Ingest function executed"

Expand All @@ -108,7 +109,7 @@ def on_request(ch, method, props, body):
operation = message['pattern']['cmd']

if operation == 'ingest':
response = ingest(config['source_website'], config['website_repo'], website_generated_path, website_source_path)
response = ingest(config['source_website'], config['website_repo'], website_generated_path, website_source_path, config['source_website2'], config['website_repo2'], website_generated_path2, website_source_path2)
else:
if user_id is None:
response = "userId not provided"
Expand Down
11 changes: 11 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,29 @@
load_dotenv()

config = {
"llm_deployment_name": os.getenv('LLM_DEPLOYMENT_NAME'),
"embeddings_deployment_name": os.getenv('EMBEDDINGS_DEPLOYMENT_NAME'),
"openai_api_version": os.getenv('OPENAI_API_VERSION'),
"rabbitmq_host": os.getenv('RABBITMQ_HOST'),
"rabbitmq_user": os.getenv('RABBITMQ_USER'),
"rabbitmq_password": os.getenv('RABBITMQ_PASSWORD'),
"rabbitmqrequestqueue": "alkemio-chat-guidance",
"source_website": os.getenv('AI_SOURCE_WEBSITE'),
"website_repo": os.getenv('AI_WEBSITE_REPO'),
"source_website2": os.getenv('AI_SOURCE_WEBSITE2'),
"website_repo2": os.getenv('AI_WEBSITE_REPO2'),
"github_user": os.getenv('AI_GITHUB_USER'),
"github_pat": os.getenv('AI_GITHUB_PAT'),
"local_path": os.getenv('AI_LOCAL_PATH')
}

local_path = config['local_path']
github_user = config['github_user']
github_pat = config['github_pat']
website_source_path = local_path + '/website/source'
website_source_path2 = local_path + '/website2/source'
website_generated_path = local_path + '/website/generated'
website_generated_path2 = local_path + '/website2/generated'
vectordb_path = local_path + "/vectordb"
generate_website = True

Expand Down
29 changes: 18 additions & 11 deletions def_ingest.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import os
import logging
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
import xml.etree.ElementTree as ET
Expand All @@ -12,14 +12,14 @@
import shutil
import subprocess
import xml.etree.ElementTree as ET
from config import local_path, website_generated_path, vectordb_path, LOG_LEVEL
from config import config, local_path, website_generated_path, website_generated_path2, vectordb_path, website_source_path, website_source_path2, github_user, github_pat, github_pat, LOG_LEVEL

# configure logging
logger = logging.getLogger(__name__)

# Create handlers
c_handler = logging.StreamHandler()
f_handler = logging.FileHandler(local_path+'/app.log')
f_handler = logging.FileHandler(os.path.join(os.path.expanduser(local_path),'app.log'))

c_handler.setLevel(level=getattr(logging, LOG_LEVEL))
f_handler.setLevel(logging.ERROR)
Expand Down Expand Up @@ -64,18 +64,23 @@ def extract_urls_from_sitemap(base_directory):


def embed_text(texts, save_loc):
embeddings = OpenAIEmbeddings(deployment=os.environ["AI_EMBEDDINGS_DEPLOYMENT_NAME"], chunk_size=1)
embeddings = AzureOpenAIEmbeddings(
azure_deployment=config['embeddings_deployment_name'],
openai_api_version=config['openai_api_version'],
chunk_size=1
)
docsearch = FAISS.from_documents(texts, embeddings)

docsearch.save_local(save_loc)

def read_and_parse_html(local_source_path, source_website_url):
def read_and_parse_html(local_source_path, source_website_url, website_generated_path):
"""
Purpose: read the target files from disk, transform html to readable text, remove sequnetial CR and space sequences, fix the document source address
and split into chunks.
Args:
local_source_path: path to directory containing local html files
source_website_url: base url of source website
website_generated_path: path to directory containing generated html files
Returns: list of parses and split doucments
"""
# Transform
Expand All @@ -101,7 +106,7 @@ def read_and_parse_html(local_source_path, source_website_url):
#body_text.page_content = re.sub(r'(\n ){2,}', '\n', re.sub(r'\n+', '\n', re.sub(r' +', ' ', body_text.page_content)))

# remove the local directory from the source object
body_text.metadata['source'] = body_text.metadata['source'].replace(local_source_path, source_website_url)
body_text.metadata['source'] = body_text.metadata['source'].replace(website_generated_path, source_website_url)

data.append(body_text)

Expand Down Expand Up @@ -137,7 +142,7 @@ def clone_and_generate(website_repo, destination_path, source_path):
logger.info(f"git switch result: {result_switch.stdout}")
else:
# Repository doesn't exist, perform a git clone
clone_command = ['git', 'clone', website_repo, source_path]
clone_command = ['git', 'clone', "https://" + github_user + ":" + github_pat + "@" + website_repo, source_path]
result_clone = subprocess.run(clone_command, capture_output=True, text=True)
logger.info(f"git clone result: {result_clone.stdout}")
result_switch = subprocess.run(git_switch_command, cwd=source_path, capture_output=True, text=True)
Expand All @@ -155,10 +160,10 @@ def clone_and_generate(website_repo, destination_path, source_path):
logger.error(f"hugo result: {result_hugo.stdout}")


def mainapp(source_website_url) -> None:
def mainapp(source_website_url, source_website_url2) -> None:
"""
Purpose:
ingest the trnaformed website contents into a vector database in presized chunks.
ingest the transformed website contents into a vector database in presized chunks.
Args:
source_website_url: full url of source website, used to return the proper link for the source documents.
Returns:
Expand All @@ -169,7 +174,9 @@ def mainapp(source_website_url) -> None:
f = open(local_path+"/ingestion_output.txt", "w")

# read and parse the files
texts = read_and_parse_html(website_generated_path, source_website_url)
# local_source_path, source_website_url, website_generated_path
texts = read_and_parse_html(website_source_path, source_website_url, website_generated_path)
texts += read_and_parse_html(website_source_path2, source_website_url2, website_generated_path2)

# Save embeddings to vectordb
embed_text(texts, vectordb_path)
Expand All @@ -180,4 +187,4 @@ def mainapp(source_website_url) -> None:

# only execute if this is the main program run (so not imported)
if __name__ == "__main__":
mainapp(os.getenv('AI_SOURCE_WEBSITE'))
mainapp(os.getenv('AI_SOURCE_WEBSITE'),os.getenv('AI_SOURCE_WEBSITE2'))
4 changes: 2 additions & 2 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ services:
container_name: guidance-engine
volumes:
- '/dev/shm:/dev/shm'
- ~/alkemio/data:/home/alkemio/data
- '~/alkemio/data:/home/alkemio/data'
env_file:
- .env
image: alkemio/guidance-engine:v0.2.0
image: alkemio/guidance-engine:v0.4.0
depends_on:
rabbitmq:
condition: "service_healthy"
Expand Down
Loading

0 comments on commit 8a301cc

Please sign in to comment.