add 2nd website (#32)

* add 2nd website * updated for latest env vars; typo * Poetry update * bump dependencies * update README --------- Co-authored-by: Neil Smyth <[email protected]> Co-authored-by: Valentin Yanakiev <[email protected]>
alkem-io · Dec 12, 2023 · 8a301cc · 8a301cc
1 parent f77963d
commit 8a301cc
Show file tree

Hide file tree

Showing 12 changed files with 945 additions and 835 deletions.
diff --git a/.azure-template.env b/.azure-template.env
@@ -1,18 +1,20 @@
-OPENAI_API_TYPE=azure
-OPENAI_API_BASE=https://alkemio-gpt.openai.azure.com/
-OPENAI_API_KEY=azure-openai-key
 OPENAI_API_VERSION=2023-05-15
+AZURE_OPENAI_ENDPOINT=https://alkemio-gpt.openai.azure.com
+AZURE_OPENAI_API_KEY=azure-openai-key
+LLM_DEPLOYMENT_NAME=deploy-gpt-35-turbo
+EMBEDDINGS_DEPLOYMENT_NAME=embedding
 RABBITMQ_HOST=localhost
 RABBITMQ_USER=admin
 RABBITMQ_PASSWORD=super-secure-pass
 AI_MODEL_TEMPERATURE=0.3
-AI_MODEL_NAME=gpt-35-turbo
-AI_DEPLOYMENT_NAME=deploy-gpt-35-turbo
-AI_EMBEDDINGS_DEPLOYMENT_NAME=embedding
 AI_SOURCE_WEBSITE=https://www.alkemio.org
+AI_SOURCE_WEBSITE2=https://welcome.alkem.io
 AI_LOCAL_PATH=~/alkemio/data
-AI_WEBSITE_REPO=https://github.com/alkem-io/website.git
+AI_WEBSITE_REPO=github.com/alkem-io/website.git
+AI_WEBSITE_REPO2=github.com/alkem-io/welcome-site.git
+AI_GITHUB_USER=github-user-for-website-cloning
+AI_GITHUB_PAT=github-user-for-website-cloning
 LANGCHAIN_TRACING_V2=true
 LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
 LANGCHAIN_API_KEY="langsmith-api-key"
-LANGCHAIN_PROJECT="guidance-engine"
+LANGCHAIN_PROJECT="guidance-engine"
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,4 @@ azure.env
 /__pycache__/*
 /vectordb/*
 local.env
+docker-compose-local.yaml
diff --git a/.openai-template.env b/.openai-template.env
diff --git a/Dockerfile b/Dockerfile
@@ -4,8 +4,8 @@ FROM python:3.11-slim-bookworm
 # Set the working directory in the container to /app
 WORKDIR /app
 
-ARG GO_VERSION=1.21.1
-ARG HUGO_VERSION=0.118.2
+ARG GO_VERSION=1.21.5
+ARG HUGO_VERSION=0.121.1
 ARG ARCHITECTURE=amd64
 
 # install git, go and hugo

diff --git a/README.md b/README.md
@@ -60,8 +60,8 @@ There is a draft implementation for the interaction language of the model (this
 
 ### Docker 
 The following command can be used to build the container from the Docker CLI (default architecture is amd64, so `--build-arg ARCHITECTURE=arm64` for amd64 builds):
-`docker build --build-arg ARCHITECTURE=arm64 --no-cache -t alkemio/guidance-engine:v0.2.0 .`
-`docker build--no-cache -t alkemio/guidance-engine:v0.2.0 .`
+`docker build --build-arg ARCHITECTURE=arm64 --no-cache -t alkemio/guidance-engine:v0.4.0 .`
+`docker build --no-cache -t alkemio/guidance-engine:v0.2.0 .`
 The Dockerfile has some self-explanatory configuration arguments.
 
 The following command can be used to start the container from the Docker CLI:
@@ -70,22 +70,28 @@ where `.env` based on `.azure-template.env`
 Alternatively use `docker-compose up -d`.
 
 with:
-- `OPENAI_API_KEY`: a valid OpenAI API key
-- `OPENAI_API_TYPE`: a valid OpenAI API type. For Azure, the value is `azure`
+- `AZURE_OPENAI_API_KEY`: a valid OpenAI API key
 - `OPENAI_API_VERSION`: a valid Azure OpenAI version. At the moment of writing, latest is `2023-05-15`
-- `OPENAI_API_BASE`: a valid Azure OpenAI base URL, e.g. `https://{your-azure-resource-name}.openai.azure.com/`
+- `AZURE_OPENAI_ENDPOINT`: a valid Azure OpenAI base URL, e.g. `https://{your-azure-resource-name}.openai.azure.com/`
 - `RABBITMQ_HOST`: the RabbitMQ host name
 - `RABBITMQ_USER`: the RabbitMQ user
 - `RABBITMQ_PASSWORD`: the RabbitMQ password
 - `AI_MODEL_TEMPERATURE`: the `temperature` of the model, use value between 0 and 1. 1 means more randomized answer, closer to 0 - a stricter one
-- `AI_MODEL_NAME`: the model name in Azure
-- `AI_DEPLOYMENT_NAME`: the AI gpt model deployment name in Azure
-- `AI_EMBEDDINGS_DEPLOYMENT_NAME`: the AI embeddings model deployment name in Azure
-- `AI_SOURCE_WEBSITE`: the URL of the website that contains the source data (for references only)
+- `LLM_DEPLOYMENT_NAME`: the AI gpt model deployment name in Azure
+- `EMBEDDINGS_DEPLOYMENT_NAME`: the AI embeddings model deployment name in Azure
+- `AI_SOURCE_WEBSITE`: the URL of the foundation website that contains the source data (for references only)
+- `AI_SOURCE_WEBSITE2`: the URL of the welcome website that contains the source data (for references only)
 - `AI_LOCAL_PATH`: local file path for storing data
-- `AI_WEBSITE_REPO`: url of the Git repository containing the website source data, based on Hugo
+- `AI_WEBSITE_REPO`: url of the Git repository containing the foundation website source data, based on Hugo - without https
+- `AI_WEBSITE_REPO2`: url of the Git repository containing the welcome website source data, based on Hugo - without https
+- `AI_GITHUB_USER` : Github user used for cloning website repos
+- `AI_GITHUB_PAT` : Personal access token for cloning website repos
+- `LANGCHAIN_TRACING_V2` : enable Langchain tracing
+- `LANGCHAIN_ENDPOINT` : Langchain tracing endpoint (e.g. "https://api.smith.langchain.com")
+- `LANGCHAIN_API_KEY` : Langchain tracing API key
+- `LANGCHAIN_PROJECT` : Langchain tracing project name (e.g. "guidance-engine")
 
-You can find sample values in `.azure-template.env` and `.openai-template.env`. Configure them and create `.env` file with the updated settings.
+You can find sample values in `.azure-template.env`. Configure them and create `.env` file with the updated settings.
 
 ### Python & Poetry
 The project requires Python & Poetry installed. The minimum version dependencies can be found at `pyproject.toml`.
@@ -102,9 +108,7 @@ The following tasks are still outstanding:
 - assess overall quality and performance of the model and make improvements as and when required.
 - assess the need to summarize the chat history to avoid exceeding the prompt token limit.
 - update the yaml manifest.
-- add error handling.
 - perform extensive testing, in particular in multi-user scenarios.
-- look at improvements of the ingestion. As a minimum the service engine should not consume queries whilst the ingestion is ongoing, as thatwill lead to errors.
-- look at the use of `temperature` for the `QARetrievalChain`. It is not so obvious how this is handled.
+- look at improvements of the ingestion. As a minimum the service engine should not consume queries whilst the ingestion is ongoing, as that will lead to errors.
 - look at the possibility to implement reinforcement learning.
-- return the actual LLM costs and token usage for queries.
+
diff --git a/ai_utils.py b/ai_utils.py
@@ -1,4 +1,4 @@
-from langchain.embeddings import OpenAIEmbeddings
+from langchain.embeddings import AzureOpenAIEmbeddings
 from langchain.vectorstores import FAISS
 from langchain.llms import AzureOpenAI
 from langchain.prompts import PromptTemplate
@@ -8,7 +8,7 @@
 from langchain.chains.conversational_retrieval.prompts import QA_PROMPT
 import logging
 import def_ingest
-from config import config, website_source_path, website_generated_path, vectordb_path, local_path, generate_website, LOG_LEVEL
+from config import config, website_source_path, website_generated_path, website_source_path2, website_generated_path2, vectordb_path, local_path, generate_website, LOG_LEVEL
 
 import os
 
@@ -17,7 +17,7 @@
 
 # Create handlers
 c_handler = logging.StreamHandler()
-f_handler = logging.FileHandler(local_path+'/app.log')
+f_handler = logging.FileHandler(os.path.join(os.path.expanduser(local_path),'app.log'))
 
 c_handler.setLevel(level=getattr(logging, LOG_LEVEL))
 f_handler.setLevel(logging.ERROR)
@@ -118,12 +118,18 @@ def get_language_by_code(language_code):
     template=chat_template, input_variables=["question", "context", "chat_history"]
 )
 
-generic_llm = AzureOpenAI(deployment_name=os.environ["AI_DEPLOYMENT_NAME"], model_name=os.environ["AI_MODEL_NAME"],
+
+generic_llm = AzureOpenAI(azure_deployment=os.environ["LLM_DEPLOYMENT_NAME"],
                             temperature=0, verbose=verbose_models)
 
 question_generator = LLMChain(llm=generic_llm, prompt=custom_question_prompt, verbose=verbose_models)
 
-embeddings = OpenAIEmbeddings(deployment=os.environ["AI_EMBEDDINGS_DEPLOYMENT_NAME"], chunk_size=1)
+
+embeddings = AzureOpenAIEmbeddings(
+    azure_deployment=config['embeddings_deployment_name'],
+    openai_api_version=config['openai_api_version'],
+    chunk_size=1
+)
 
 # Check if the vector database exists
 if os.path.exists(vectordb_path+"/index.pkl"):
@@ -132,19 +138,20 @@ def get_language_by_code(language_code):
     # ingest data
     if generate_website:
         def_ingest.clone_and_generate(config['website_repo'], website_generated_path, website_source_path)
-    def_ingest.mainapp(config['source_website'])
+        def_ingest.clone_and_generate(config['website_repo2'], website_generated_path2, website_source_path2)
+    def_ingest.mainapp(config['source_website'], config['source_website2'])
 
 vectorstore = FAISS.load_local(vectordb_path, embeddings)
 retriever = vectorstore.as_retriever()
 
-chat_llm = AzureChatOpenAI(deployment_name=os.environ["AI_DEPLOYMENT_NAME"],
-                            model_name=os.environ["AI_MODEL_NAME"], temperature=os.environ["AI_MODEL_TEMPERATURE"],
+chat_llm = AzureChatOpenAI(azure_deployment=os.environ["LLM_DEPLOYMENT_NAME"],
+                            temperature=os.environ["AI_MODEL_TEMPERATURE"],
                             max_tokens=max_token_limit)
 
 doc_chain = load_qa_chain(generic_llm, chain_type="stuff", prompt=QA_PROMPT, verbose=verbose_models)
 
 def translate_answer(answer, language):
-    translate_llm = AzureChatOpenAI(deployment_name=os.environ["AI_DEPLOYMENT_NAME"], model_name=os.environ["AI_MODEL_NAME"],
+    translate_llm = AzureChatOpenAI(azure_deployment=os.environ["LLM_DEPLOYMENT_NAME"],
                                 temperature=0, verbose=verbose_models)
     prompt = translation_prompt.format(answer=answer, language=language)
     return translate_llm(prompt)

diff --git a/app.py b/app.py
@@ -4,7 +4,7 @@
 import ai_utils
 import logging
 import def_ingest
-from config import config, website_source_path, website_generated_path, vectordb_path, generate_website, local_path, LOG_LEVEL
+from config import config, website_source_path, website_generated_path, website_source_path2, website_generated_path2, vectordb_path, generate_website, local_path, LOG_LEVEL
 
 # configure logging
 logger = logging.getLogger(__name__)
@@ -95,9 +95,10 @@ def reset(user_id):
     }
     return "Reset function executed"
 
-def ingest(source_url, website_repo, destination_path, source_path):
+def ingest(source_url, website_repo, destination_path, source_path, source_url2, website_repo2, destination_path2, source_path2):
     def_ingest.clone_and_generate(website_repo, destination_path, source_path)
-    def_ingest.mainapp(source_url)
+    def_ingest.clone_and_generate(website_repo2, destination_path2, source_path2)
+    def_ingest.mainapp(source_url, source_url2)
 
     return "Ingest function executed"
 
@@ -108,7 +109,7 @@ def on_request(ch, method, props, body):
     operation = message['pattern']['cmd']
 
     if operation == 'ingest':
-        response = ingest(config['source_website'], config['website_repo'], website_generated_path, website_source_path)
+        response = ingest(config['source_website'], config['website_repo'], website_generated_path, website_source_path, config['source_website2'], config['website_repo2'], website_generated_path2, website_source_path2)
     else:
         if user_id is None:
             response = "userId not provided"

diff --git a/config.py b/config.py
@@ -3,18 +3,29 @@
 load_dotenv()
 
 config = {
+    "llm_deployment_name": os.getenv('LLM_DEPLOYMENT_NAME'),
+    "embeddings_deployment_name": os.getenv('EMBEDDINGS_DEPLOYMENT_NAME'),
+    "openai_api_version": os.getenv('OPENAI_API_VERSION'),
     "rabbitmq_host": os.getenv('RABBITMQ_HOST'),
     "rabbitmq_user": os.getenv('RABBITMQ_USER'),
     "rabbitmq_password": os.getenv('RABBITMQ_PASSWORD'),
     "rabbitmqrequestqueue": "alkemio-chat-guidance",
     "source_website": os.getenv('AI_SOURCE_WEBSITE'),
     "website_repo": os.getenv('AI_WEBSITE_REPO'),
+    "source_website2": os.getenv('AI_SOURCE_WEBSITE2'),
+    "website_repo2": os.getenv('AI_WEBSITE_REPO2'),
+    "github_user": os.getenv('AI_GITHUB_USER'),
+    "github_pat": os.getenv('AI_GITHUB_PAT'),
     "local_path": os.getenv('AI_LOCAL_PATH')
 }
 
 local_path = config['local_path']
+github_user = config['github_user']
+github_pat = config['github_pat']
 website_source_path = local_path + '/website/source'
+website_source_path2 = local_path + '/website2/source'
 website_generated_path = local_path + '/website/generated'
+website_generated_path2 = local_path + '/website2/generated'
 vectordb_path = local_path + "/vectordb"
 generate_website = True
 

diff --git a/def_ingest.py b/def_ingest.py
@@ -1,6 +1,6 @@
 import os
 import logging
-from langchain.embeddings import OpenAIEmbeddings
+from langchain.embeddings import AzureOpenAIEmbeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import FAISS
 import xml.etree.ElementTree as ET
@@ -12,14 +12,14 @@
 import shutil
 import subprocess
 import xml.etree.ElementTree as ET
-from config import local_path, website_generated_path, vectordb_path, LOG_LEVEL
+from config import config, local_path, website_generated_path, website_generated_path2, vectordb_path, website_source_path, website_source_path2, github_user, github_pat, github_pat, LOG_LEVEL
 
 # configure logging
 logger = logging.getLogger(__name__)
 
 # Create handlers
 c_handler = logging.StreamHandler()
-f_handler = logging.FileHandler(local_path+'/app.log')
+f_handler = logging.FileHandler(os.path.join(os.path.expanduser(local_path),'app.log'))
 
 c_handler.setLevel(level=getattr(logging, LOG_LEVEL))
 f_handler.setLevel(logging.ERROR)
@@ -64,18 +64,23 @@ def extract_urls_from_sitemap(base_directory):
 
 
 def embed_text(texts, save_loc):
-    embeddings = OpenAIEmbeddings(deployment=os.environ["AI_EMBEDDINGS_DEPLOYMENT_NAME"], chunk_size=1)
+    embeddings = AzureOpenAIEmbeddings(
+    azure_deployment=config['embeddings_deployment_name'],
+    openai_api_version=config['openai_api_version'],
+    chunk_size=1
+)
     docsearch = FAISS.from_documents(texts, embeddings)
 
     docsearch.save_local(save_loc)
 
-def read_and_parse_html(local_source_path, source_website_url):
+def read_and_parse_html(local_source_path, source_website_url, website_generated_path):
     """
     Purpose: read the target files from disk, transform html to readable text, remove sequnetial CR and space sequences, fix the document source address
              and split into chunks.
     Args:
         local_source_path: path to directory containing local html files
         source_website_url: base url of source website
+        website_generated_path: path to directory containing generated html files
     Returns: list of parses and split doucments
     """
     # Transform
@@ -101,7 +106,7 @@ def read_and_parse_html(local_source_path, source_website_url):
         #body_text.page_content = re.sub(r'(\n ){2,}', '\n', re.sub(r'\n+', '\n', re.sub(r' +', ' ', body_text.page_content)))
 
         # remove the local directory from the source object
-        body_text.metadata['source'] = body_text.metadata['source'].replace(local_source_path, source_website_url)
+        body_text.metadata['source'] = body_text.metadata['source'].replace(website_generated_path, source_website_url)
 
         data.append(body_text)
 
@@ -137,7 +142,7 @@ def clone_and_generate(website_repo, destination_path, source_path):
         logger.info(f"git switch result: {result_switch.stdout}")
     else:
         # Repository doesn't exist, perform a git clone
-        clone_command = ['git', 'clone', website_repo, source_path]
+        clone_command = ['git', 'clone', "https://" + github_user + ":" + github_pat + "@" + website_repo, source_path]
         result_clone = subprocess.run(clone_command, capture_output=True, text=True)
         logger.info(f"git clone result: {result_clone.stdout}")
         result_switch = subprocess.run(git_switch_command, cwd=source_path, capture_output=True, text=True)
@@ -155,10 +160,10 @@ def clone_and_generate(website_repo, destination_path, source_path):
     logger.error(f"hugo result: {result_hugo.stdout}")
 
 
-def mainapp(source_website_url) -> None:
+def mainapp(source_website_url, source_website_url2) -> None:
     """
     Purpose:
-        ingest the trnaformed website contents into a vector database in presized chunks.
+        ingest the transformed website contents into a vector database in presized chunks.
     Args:
         source_website_url: full url of source website, used to return the proper link for the source documents.
     Returns:
@@ -169,7 +174,9 @@ def mainapp(source_website_url) -> None:
     f = open(local_path+"/ingestion_output.txt", "w")
 
     # read and parse the files
-    texts = read_and_parse_html(website_generated_path, source_website_url)
+    # local_source_path, source_website_url, website_generated_path
+    texts = read_and_parse_html(website_source_path, source_website_url, website_generated_path)
+    texts += read_and_parse_html(website_source_path2, source_website_url2, website_generated_path2)
 
     # Save embeddings to vectordb
     embed_text(texts, vectordb_path)
@@ -180,4 +187,4 @@ def mainapp(source_website_url) -> None:
 
 # only execute if this is the main program run (so not imported)
 if __name__ == "__main__":
-    mainapp(os.getenv('AI_SOURCE_WEBSITE'))
+    mainapp(os.getenv('AI_SOURCE_WEBSITE'),os.getenv('AI_SOURCE_WEBSITE2'))
diff --git a/docker-compose.yaml b/docker-compose.yaml
@@ -8,10 +8,10 @@ services:
       container_name: guidance-engine
       volumes:
           - '/dev/shm:/dev/shm'
-          - ~/alkemio/data:/home/alkemio/data
+          - '~/alkemio/data:/home/alkemio/data'
       env_file:
         - .env
-      image: alkemio/guidance-engine:v0.2.0
+      image: alkemio/guidance-engine:v0.4.0
       depends_on:
         rabbitmq:
           condition: "service_healthy"