updated docs

deepaksood619 · Dec 18, 2023 · 8fb6384 · 8fb6384
1 parent a4bd2d5
commit 8fb6384
Show file tree

Hide file tree

Showing 30 changed files with 1,019 additions and 40 deletions.
diff --git a/docs/ai/courses/customer-analytics-in-python/intro.md b/docs/ai/courses/customer-analytics-in-python/intro.md
@@ -96,7 +96,7 @@ It is expected that: units sold from a brand would increase if the unit price of
 
 - Linear dependency between variables
 
-df_segmentation.corr()
+ `df_segmentation.corr()`
 
 - Ranges from -1 to 1
 

diff --git a/docs/ai/data-science/big-data/data-preprocessing.md b/docs/ai/data-science/big-data/data-preprocessing.md
@@ -52,6 +52,15 @@ p and q are the attribute values for two data objects
 
 ![image](../../../media/Data-Preprocessing-image2.jpg)
 
+### Types
+
+1. Euclidean Distance
+2. Mahalanobis Distance
+3. Manhattan Distance
+4. Jaccard Similarity
+5. Minkowski Distance
+6. Cosine Similarity
+
 ### Euclidean Distance
 
 ![image](../../../media/Data-Preprocessing-image3.jpg)
@@ -69,6 +78,10 @@ p and q are the attribute values for two data objects
 
 ![image](../../../media/Data-Preprocessing-image5.jpg)
 
+[Cosine Similarity - GeeksforGeeks](https://www.geeksforgeeks.org/cosine-similarity/)
+
+[Cosine similarity: How does it measure the similarity, Maths behind and usage in Python | by Varun | Towards Data Science](https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db)
+
 ### Similarity Between Binary Vectors
 
 ![image](../../../media/Data-Preprocessing-image6.jpg)

diff --git a/docs/ai/data-science/datasets.md b/docs/ai/data-science/datasets.md
@@ -18,6 +18,16 @@ The ARC Corpus contains 14M unordered, science-related sentences including knowl
 
 [WikiText-103 Dataset | Papers With Code](https://paperswithcode.com/dataset/wikitext-103)
 
+### BBH - [OpenCompass](https://opencompass.org.cn/dataset-detail/BBH)
+
+A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater.
+
+### BIG-Bench
+
+The Beyond the Imitation Game Benchmark (BIG-bench) is a _collaborative_ benchmark intended to probe large language models and extrapolate their future capabilities.
+
+[GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models](https://github.com/google/BIG-bench)
+
 ## YCSB Workloads
 
 YCSB includes a set of core workloads that define a basic benchmark for cloud systems.

diff --git a/docs/ai/llm/llm-building.md b/docs/ai/llm/llm-building.md
@@ -15,16 +15,6 @@
 
 [Emerging Architectures for LLM Applications | Andreessen Horowitz](https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/)
 
-### RAG - retrieval-augmented generation
-
-RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.
-
-[Using ChatGPT to Search Enterprise Data with Pamela Fox - YouTube](https://www.youtube.com/watch?v=lj5NjKHuFlo)
-
-[What is retrieval-augmented generation? | IBM Research Blog](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)
-
-[What is Retrieval-Augmented Generation (RAG)? - YouTube](https://youtu.be/T-D1OfcDW1M?si=KoUb-NXATK50d3i7)
-
 [Transformers, explained: Understand the model behind GPT, BERT, and T5 - YouTube](https://youtu.be/SZorAJ4I-sA?si=-GMfzGThDO20aGkB)
 
 - Positional encodings
@@ -69,7 +59,7 @@ RAG is an AI framework for retrieving facts from an external knowledge base to g
 - Watermarking & evasion
 - Model theft
 
-[[1hr Talk] Intro to Large Language Models - YouTube](https://www.youtube.com/watch?v=zjkBMFhNj_g)
+[1hr Talk Intro to Large Language Models - YouTube](https://www.youtube.com/watch?v=zjkBMFhNj_g)
 
 ## Dev Tools
 
@@ -108,6 +98,16 @@ chainlit run document_qa.py
 
 ### HuggingFace
 
+#### About
+
+[How to choose a Sentence Transformer from Hugging Face | Weaviate - Vector Database](https://weaviate.io/blog/how-to-choose-a-sentence-transformer-from-hugging-face)
+
+- Blue - the **dataset** it was trained on
+- Green - the **language** of the dataset
+- White or Purple - **additional details** about the model
+
+#### Models
+
 - [GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.](https://github.com/huggingface/transformers)
 - [Hugging Face – The AI community building the future.](https://huggingface.co/)
 

diff --git a/docs/ai/llm/rag-retrieval-augmented-generation.md b/docs/ai/llm/rag-retrieval-augmented-generation.md
@@ -0,0 +1,11 @@
+# RAG - retrieval-augmented generation
+
+RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.
+
+[Using ChatGPT to Search Enterprise Data with Pamela Fox - YouTube](https://www.youtube.com/watch?v=lj5NjKHuFlo)
+
+[What is retrieval-augmented generation? | IBM Research Blog](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)
+
+[What is Retrieval-Augmented Generation (RAG)? - YouTube](https://youtu.be/T-D1OfcDW1M?si=KoUb-NXATK50d3i7)
+
+[**Vector Search RAG Tutorial – Combine Your Data with LLMs with Advanced Search - YouTube**](https://www.youtube.com/watch?v=JEBDfGqrAUA)
diff --git a/docs/ai/llm/readme.md b/docs/ai/llm/readme.md
@@ -2,6 +2,7 @@
 
 - [LLM Building](ai/llm/llm-building.md)
 - [Design Patterns](ai/llm/design-patterns.md)
+- [RAG Retrieval Augmented Generation](ai/llm/rag-retrieval-augmented-generation.md)
 - [ChatGPT Prompt Engineering](ai/courses/chatgpt-prompt-eng.md)
 
 MMLU - Massive Multitask Language Understanding
@@ -19,7 +20,8 @@ Moving from information to knowledge age
 - [Meet Bard](https://bard.google.com/)
 - https://openai.com/blog/chatgpt
 - [Godmode](https://godmode.space/)
-- [OpenAI Platform](https://platform.openai.com/)
+- [**OpenAI Platform**](https://platform.openai.com/)
+  - [Embeddings - OpenAI API](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)
 - [GPT-4](https://openai.com/research/gpt-4)
 - [It’s Time to Pay Attention to A.I. (ChatGPT and Beyond)](https://www.youtube.com/watch?v=0uQqMxXoNVs)
 - https://en.wikipedia.org/wiki/GPT-3
@@ -98,6 +100,13 @@ Moving from information to knowledge age
 - Amazon EC2 P4d/P4de instances - Powered by NVIDIA A100 Tensor Core GPUs
 - Amazon EC2 G5 instances - Powered by NVIDIA A10G Tensor Core GPUs
 
+## Models
+
+- [openai/whisper-large-v3 · Hugging Face](https://huggingface.co/openai/whisper-large-v3)
+  - Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
+  - [GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision](https://github.com/openai/whisper)
+- [sentence-transformers/all-MiniLM-L6-v2 · Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+
 ## Links
 
 - [ChatGPT Prompt Engineering for Developers](ai/courses/chatgpt-prompt-eng.md)

diff --git a/docs/ai/ml-algorithms/embeddings-and-estimators.md b/docs/ai/ml-algorithms/embeddings-and-estimators.md
@@ -5,7 +5,7 @@
 
 ## Embeddings
 
-## An embedding of a vector is another vector in a smaller dimensional space
+An embedding of a vector is another vector in a smaller dimensional space
 
 - Manage sparse data
 - Make machine learning models that use sparse data consume less memory and train faster
@@ -15,6 +15,8 @@
 
 https://www.toptal.com/machine-learning/embeddings-in-machine-learning
 
+[Embeddings  |  Machine Learning  |  Google for Developers](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture)
+
 ## Summary of Embeddings
 
 ![image](../../media/Embeddings-&-Estimators-image1.jpg)

diff --git a/docs/ai/ml-algorithms/readme.md b/docs/ai/ml-algorithms/readme.md
@@ -28,5 +28,6 @@
 - [Feature Engineering](feature-engineering)
 - [Regularization](regularization)
 - [Embedding and Estimators](embeddings-and-estimators)
+- [Vector Embeddings](ai/ml-algorithms/vector-embeddings.md)
 - [Dimensionality Reduction](dimensionality-reduction)
 - [Others](ai/ml-algorithms/others.md)
diff --git a/docs/ai/ml-algorithms/vector-embeddings.md b/docs/ai/ml-algorithms/vector-embeddings.md
@@ -0,0 +1,78 @@
+# Vector Embeddings
+
+**Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships.** They represent different data types as points in a multidimensional space, where similar data points are clustered closer together. These numerical representations help machines understand and process this data more effectively.
+
+[Word](https://www.elastic.co/what-is/word-embedding) and sentence embeddings are two of the most common subtypes of vector embeddings, but there are others. Some vector embeddings can represent entire documents, as well as image vectors designed to match up visual content, user profile vectors to determine a user’s preferences, product vectors that help identify similar products and many others. Vector embeddings help [machine learning](https://www.elastic.co/what-is/machine-learning) algorithms find patterns in data and perform tasks such as [sentiment analysis](https://www.elastic.co/what-is/sentiment-analysis), language translation, recommendation systems, and many more.
+
+![vector-embeddings](../../media/Pasted%20image%2020231216192551.png)
+
+### Types of vector embeddings
+
+#### [Word embeddings](https://www.elastic.co/what-is/word-embedding)
+
+Represent individual words as vectors. Techniques like Word2Vec, GloVe, and FastText learn word embeddings by capturing semantic relationships and contextual information from large text corpora.
+
+#### Sentence embeddings
+
+Represent entire sentences as vectors. Models like Universal Sentence Encoder (USE) and SkipThought generate embeddings that capture the overall meaning and context of the sentences.
+
+#### Document embeddings
+
+Represent documents (anything from newspaper articles and academic papers to books) as vectors. They capture the semantic information and context of the entire document. Techniques like Doc2Vec and Paragraph Vectors are designed to learn document embeddings.
+
+#### Image embeddings
+
+Represent images as vectors by capturing different visual features. Techniques like convolutional neural networks (CNNs) and pre-trained models like ResNet and VGG generate image embeddings for tasks like image classification, object detection, and image similarity.
+
+#### User embeddings
+
+Represent users in a system or platform as vectors. They capture user preferences, [behaviors](https://www.elastic.co/what-is/user-behavior-analytics), and characteristics. User embeddings can be used in everything from recommendation systems to personalized marketing as well as user segmentation.
+
+#### Product embeddings
+
+Represent products in ecommerce or recommendation systems as vectors. They capture a product’s attributes, features, and any other semantic information available. Algorithms can then use these embeddings to compare, recommend, and analyze products based on their vector representations.
+
+### Are embeddings and vectors the same thing?
+
+In the context of vector embeddings, yes, embeddings and vectors are the same thing. Both refer to numerical representations of data, where each data point is represented by a vector in a high-dimensional space.
+
+### Use Cases
+
+1. Recommendation systems (i.e. Netflix-style if-you-like-these-movies-you’ll-like-this-one-too)
+2. All kinds of search
+	1. Text search (like Google Search)
+	2. Image search (like Google Reverse Image Search)
+3. Chatbots and question-answering systems
+4. Data preprocessing (preparing data to be fed into a machine learning model)
+5. One-shot/zero-shot learning (i.e. machine learning models that learn from almost no training data)
+6. Fraud detection/outlier detection
+7. Typo detection and all manners of “fuzzy matching”
+8. Detecting when ML models go stale (drift)
+
+[What are vector embeddings? | A Comprehensive Vector Embeddings Guide | Elastic](https://www.elastic.co/what-is/vector-embedding)
+
+[Meet AI’s multitool: Vector embeddings | Google Cloud Blog](https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings)
+
+[What are Vector Embeddings | Pinecone](https://www.pinecone.io/learn/vector-embeddings/)
+
+## Text Embeddings / Transformers
+
+[GitHub - SeanLee97/AnglE: Angle-optimized Text Embeddings | 🔥 SOTA on STS and MTEB Leaderboard](https://github.com/SeanLee97/AnglE)
+
+[MTEB Leaderboard - a Hugging Face Space by mteb](https://huggingface.co/spaces/mteb/leaderboard)
+
+[OpenAI Platform](https://platform.openai.com/tokenizer)
+
+[sentence-transformers/all-MiniLM-L6-v2 · Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
+
+[Pretrained Models — Sentence-Transformers documentation](https://www.sbert.net/docs/pretrained_models.html)
+
+[sentence-transformers (Sentence Transformers)](https://huggingface.co/sentence-transformers)
+
+## Links
+
+[word-embedding-to-transformers](ai/nlp/word-embedding-to-transformers.md)
+
+[**Vector Embeddings Tutorial – Code Your Own AI Assistant with GPT-4 API + LangChain + NLP - YouTube**](https://www.youtube.com/watch?v=yfHHvmaMkcA&ab_channel=freeCodeCamp.org)
+
+[$0 Embeddings (OpenAI vs. free & open source) - YouTube](https://www.youtube.com/watch?v=QdDoFfkVkcw&ab_channel=RabbitHoleSyndrome)
diff --git a/docs/ai/nlp/readme.md b/docs/ai/nlp/readme.md
@@ -5,3 +5,4 @@
 - [NLTK](nltk)
 - [Chatbot / Chatops](chatbot-chatops)
 - [Chatbot SAAS](ai/nlp/chatbot-saas.md)
+- [Word Embedding to Transformers](ai/nlp/word-embedding-to-transformers.md)