Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace local HuggingFace embeddings with OpenAI-compatible embeddings #6

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 93 additions & 63 deletions document_search/document_search_langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "742aa343-c90c-4e4a-8099-a3fa218e256d",
"metadata": {},
"outputs": [],
Expand All @@ -65,7 +65,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"id": "2f637730",
"metadata": {},
"outputs": [],
Expand All @@ -79,8 +79,7 @@
"from langchain.chains import RetrievalQA\n",
"from langchain_community.vectorstores import FAISS\n",
"from langchain.document_loaders.pdf import PyPDFDirectoryLoader\n",
"from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n",
"from langchain_openai import ChatOpenAI\n",
"from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter"
]
},
Expand All @@ -94,7 +93,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 13,
"id": "1e70d51a",
"metadata": {},
"outputs": [],
Expand All @@ -110,7 +109,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 14,
"id": "e9e0bec6-a89c-4fca-a218-c784ec18e109",
"metadata": {},
"outputs": [],
Expand All @@ -130,7 +129,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 7,
"id": "dd4e2417",
"metadata": {},
"outputs": [],
Expand All @@ -153,7 +152,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 15,
"id": "74b61e4f",
"metadata": {},
"outputs": [],
Expand All @@ -179,13 +178,13 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 28,
"id": "2553d130-5b02-4852-928f-beb7ecd05d3f",
"metadata": {},
"outputs": [],
"source": [
"GENERATOR_MODEL_NAME = \"Meta-Llama-3.1-8B-Instruct\"\n",
"EMBEDDING_MODEL_NAME = \"BAAI/bge-base-en-v1.5\""
"EMBEDDING_MODEL_NAME = \"bge-base-en-v1.5\""
]
},
{
Expand All @@ -204,7 +203,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 10,
"id": "6133a928",
"metadata": {},
"outputs": [],
Expand All @@ -222,7 +221,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 16,
"id": "00061d61",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -283,7 +282,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 19,
"id": "5710c72d",
"metadata": {},
"outputs": [
Expand All @@ -292,7 +291,7 @@
"output_type": "stream",
"text": [
"Number of source documents: 42\n",
"Number of text chunks: 228\n"
"Number of text chunks: 196\n"
]
}
],
Expand All @@ -319,7 +318,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"id": "24b42902-d145-4f61-80c2-334a4b1da886",
"metadata": {},
"outputs": [
Expand All @@ -332,14 +331,13 @@
}
],
"source": [
"model_kwargs = {'device': 'cuda', 'trust_remote_code': True}\n",
"encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity\n",
"\n",
"print(f\"Setting up the embeddings model...\")\n",
"embeddings = HuggingFaceEmbeddings(\n",
" model_name=EMBEDDING_MODEL_NAME,\n",
" model_kwargs=model_kwargs,\n",
" encode_kwargs=encode_kwargs,\n",
"print(f\"Setting up the embeddings model {EMBEDDING_MODEL_NAME} at {GENERATOR_BASE_URL}\")\n",
"embeddings = OpenAIEmbeddings(\n",
" model=EMBEDDING_MODEL_NAME,\n",
" # Leverage the RoBERTa tokenizer to make sure that \n",
" # the chunks stay within the 512-token context window.\n",
" tiktoken_model_name=\"roberta-base\",\n",
" tiktoken_enabled=False\n",
")"
]
},
Expand All @@ -361,7 +359,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 30,
"id": "1048c42a",
"metadata": {},
"outputs": [],
Expand All @@ -383,7 +381,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 31,
"id": "51dc81d7-8333-41e6-9e77-47a45ee0b374",
"metadata": {},
"outputs": [
Expand All @@ -394,19 +392,22 @@
"Document 1:\n",
"\n",
"5 \n",
"Annual Report 2021–22 Vector Institute\n",
"Annual Report 2021–22Vector Institute\n",
"SPOTLIGHT ON FIVE YEARS OF AI \n",
"LEADERSHIP FOR CANADIANS \n",
"SINCE THE VECTOR INSTITUTE WAS FOUNDED IN 2017: \n",
"2,080+ \n",
"Students have graduated from \n",
"Vector-recognized AI programs and \n",
"study paths $6.2 M \n",
"study paths \n",
"$6.2 M \n",
"Scholarship funds committed to \n",
"students in AI programs 3,700+ \n",
"students in AI programs \n",
"3,700+ \n",
"Postings for AI-focused jobs and \n",
"internships ofered on Vector’s \n",
"Digital Talent Hub $103 M \n",
"Digital Talent Hub \n",
"$103 M \n",
"In research funding committed to \n",
"Vector-afliated researchers \n",
"94 \n",
Expand All @@ -415,8 +416,11 @@
"Document 2:\n",
"\n",
"26 \n",
" VECTOR SCHOLARSHIPS IN \n",
"AI ATTRACT TOP TALENT TO ONTARIO UNIVERSITIES \n",
" \n",
" \n",
"VECTOR SCHOLARSHIPS IN \n",
"AI ATTRACT TOP TALENT \n",
"TO ONTARIO UNIVERSITIES \n",
"109 \n",
"Vector Scholarships in AI awarded \n",
"34 \n",
Expand All @@ -425,54 +429,80 @@
"Universities \n",
"351 \n",
"Scholarships awarded since the \n",
"program launched in 2018 Supported with funding from the Province of Ontario, the Vector Institute Scholarship in Artifcial Intelligence (VSAI) helps Ontario universities to attract the best and brightest students to study in AI-related master’s programs. \n",
"program launched in 2018 \n",
"Supported with funding from the Province of \n",
"Ontario, the Vector Institute Scholarship in Artifcial \n",
"Intelligence (VSAI) helps Ontario universities to attract \n",
"the best and brightest students to study in AI-related \n",
"master’s programs. \n",
"Scholarship recipients connect directly with leading\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 3:\n",
"\n",
"Arrows indicate year-over-year (YoY) directional change since 2020–21 The complete Ontario AI Snapshot for 2021–22 will be available soon on the Vector Institute website at vectorinstitute.ai. \n",
"The complete Ontario AI Snapshot for 2021–22 will be available soon on the \n",
"Vector Institute website at vectorinstitute.ai. \n",
"YoY \n",
"22,458 \n",
"AI jobs created YoY \n",
"59,673 \n",
"AI jobs retained YoY \n",
"AI jobs created \n",
"YoY \n",
"59,67 3 \n",
"AI jobs retained \n",
"YoY \n",
"1,775 \n",
"New AI Master’s & study path enrolments YoY \n",
"New AI Master’s & study \n",
"path enrolments \n",
"YoY \n",
"1,007 \n",
"New AI Master’s graduates from Vector-recognized programs \n",
"New AI Master’s graduates from \n",
"Vector-recognized programs \n",
"YoY \n",
"66 \n",
"New AI-related patents fled across Canada YoY \n",
"New AI-related patents \n",
"fled across Canada \n",
"YoY \n",
"$2.86 BILLION \n",
"In AI-related VC investment * YoY \n",
"273\n",
"In AI-related VC investment* \n",
"YoY \n",
"273 \n",
"Companies invested in \n",
"the Ontario AI ecosystem \n",
"YoY \n",
"50 \n",
"Companies moved into\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 4:\n",
"\n",
"my professional and academic journey.” \n",
"Alex Cui, Vector Scholarship in AI Recipient 2021–22 \n",
"“The scholarship funding from the Vector Institute \n",
"has played an instrumental role in expanding \n",
"graduate teaching, learning, and research \n",
"opportunities in AI at Queen’s University.” \n",
"Dr. Fahim Quadir, Vice-Provost and Dean, School of \n",
"Graduate Studies & Professor of Global Developmental \n",
"Studies, Queen’s University \n",
"PRACTICAL, HANDS-ON \n",
"PROGRAMMING TO FOSTER \n",
"WORKFORCE SKILLS \n",
"AND EXPERIENCE\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 5:\n",
"\n",
"23 \n",
"RESEARCH AWARDS AND \n",
"ACHIEVEMENTS \n",
"Each year, members of Vector’s research community \n",
"are recognized for outstanding contributions to AI and machine learning felds. Highlights of 2021–22 include: \n",
"are recognized for outstanding contributions to AI and \n",
"machine learning felds. Highlights of 2021–22 include: \n",
"GLOBAL REACH OF VECTOR \n",
"RESEARCHERS AND THEIR WORK \n",
"Vector researchers published papers, gave \n",
"presentations, or led workshops at many of the top AI conferences this year, including NeurIPS, CVPR, ICLR, ICML, and ACM FAccT. \n",
"380+ Research papers presented at\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 5:\n",
"\n",
"24 \n",
"Annual Report 2021–22 Vector Institute\n",
" \n",
" \n",
" TALENT & \n",
"WORKFORCE DEVELOPMENT \n",
"Vector is helping to attract, develop, and \n",
"connect the AI-skilled workforce that will transform Ontario’s economy 1,775 \n",
"AI master’s students began their studies in \n",
"recognized AI-related programs and study paths, up 27% from last year V\n",
"ector is working with both universities and employers\n"
"presentations, or led workshops at many of the \n",
"top AI conferences this year, including NeurIPS, \n",
"CVPR, ICLR, ICML, and ACM FAccT. \n",
"380+ Research papers presented at \n",
"high-impact global \n",
"conferences and in top-\n"
]
}
],
Expand All @@ -490,7 +520,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 32,
"id": "e26d9f46-a082-4497-8ffc-9fa3eccc2ef3",
"metadata": {},
"outputs": [
Expand All @@ -500,7 +530,7 @@
"text": [
"Result: \n",
"\n",
"The text does not provide the number of Vector Scholarships in AI awarded in 2022. It does provide the total number of Vector Scholarships in AI awarded since the program launched in 2018, which is 109.\n"
"According to the context, 109 Vector Scholarships in AI were awarded.\n"
]
}
],
Expand All @@ -525,9 +555,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "rag_dataloaders",
"display_name": "Python 3",
"language": "python",
"name": "rag_dataloaders"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -539,7 +569,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.12.5"
}
},
"nbformat": 4,
Expand Down
Loading