Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
deepaksood619 committed Dec 18, 2023
1 parent a4bd2d5 commit 8fb6384
Show file tree
Hide file tree
Showing 30 changed files with 1,019 additions and 40 deletions.
2 changes: 1 addition & 1 deletion docs/ai/courses/customer-analytics-in-python/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ It is expected that: units sold from a brand would increase if the unit price of

- Linear dependency between variables

df_segmentation.corr()
`df_segmentation.corr()`

- Ranges from -1 to 1

Expand Down
13 changes: 13 additions & 0 deletions docs/ai/data-science/big-data/data-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,15 @@ p and q are the attribute values for two data objects

![image](../../../media/Data-Preprocessing-image2.jpg)

### Types

1. Euclidean Distance
2. Mahalanobis Distance
3. Manhattan Distance
4. Jaccard Similarity
5. Minkowski Distance
6. Cosine Similarity

### Euclidean Distance

![image](../../../media/Data-Preprocessing-image3.jpg)
Expand All @@ -69,6 +78,10 @@ p and q are the attribute values for two data objects

![image](../../../media/Data-Preprocessing-image5.jpg)

[Cosine Similarity - GeeksforGeeks](https://www.geeksforgeeks.org/cosine-similarity/)

[Cosine similarity: How does it measure the similarity, Maths behind and usage in Python | by Varun | Towards Data Science](https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db)

### Similarity Between Binary Vectors

![image](../../../media/Data-Preprocessing-image6.jpg)
Expand Down
10 changes: 10 additions & 0 deletions docs/ai/data-science/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,16 @@ The ARC Corpus contains 14M unordered, science-related sentences including knowl

[WikiText-103 Dataset | Papers With Code](https://paperswithcode.com/dataset/wikitext-103)

### BBH - [OpenCompass](https://opencompass.org.cn/dataset-detail/BBH)

A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater.

### BIG-Bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a _collaborative_ benchmark intended to probe large language models and extrapolate their future capabilities.

[GitHub - google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models](https://github.com/google/BIG-bench)

## YCSB Workloads

YCSB includes a set of core workloads that define a basic benchmark for cloud systems.
Expand Down
22 changes: 11 additions & 11 deletions docs/ai/llm/llm-building.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,6 @@

[Emerging Architectures for LLM Applications | Andreessen Horowitz](https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/)

### RAG - retrieval-augmented generation

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.

[Using ChatGPT to Search Enterprise Data with Pamela Fox - YouTube](https://www.youtube.com/watch?v=lj5NjKHuFlo)

[What is retrieval-augmented generation? | IBM Research Blog](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)

[What is Retrieval-Augmented Generation (RAG)? - YouTube](https://youtu.be/T-D1OfcDW1M?si=KoUb-NXATK50d3i7)

[Transformers, explained: Understand the model behind GPT, BERT, and T5 - YouTube](https://youtu.be/SZorAJ4I-sA?si=-GMfzGThDO20aGkB)

- Positional encodings
Expand Down Expand Up @@ -69,7 +59,7 @@ RAG is an AI framework for retrieving facts from an external knowledge base to g
- Watermarking & evasion
- Model theft

[[1hr Talk] Intro to Large Language Models - YouTube](https://www.youtube.com/watch?v=zjkBMFhNj_g)
[1hr Talk Intro to Large Language Models - YouTube](https://www.youtube.com/watch?v=zjkBMFhNj_g)

## Dev Tools

Expand Down Expand Up @@ -108,6 +98,16 @@ chainlit run document_qa.py

### HuggingFace

#### About

[How to choose a Sentence Transformer from Hugging Face | Weaviate - Vector Database](https://weaviate.io/blog/how-to-choose-a-sentence-transformer-from-hugging-face)

- Blue - the **dataset** it was trained on
- Green - the **language** of the dataset
- White or Purple - **additional details** about the model

#### Models

- [GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.](https://github.com/huggingface/transformers)
- [Hugging Face – The AI community building the future.](https://huggingface.co/)

Expand Down
11 changes: 11 additions & 0 deletions docs/ai/llm/rag-retrieval-augmented-generation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# RAG - retrieval-augmented generation

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.

[Using ChatGPT to Search Enterprise Data with Pamela Fox - YouTube](https://www.youtube.com/watch?v=lj5NjKHuFlo)

[What is retrieval-augmented generation? | IBM Research Blog](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)

[What is Retrieval-Augmented Generation (RAG)? - YouTube](https://youtu.be/T-D1OfcDW1M?si=KoUb-NXATK50d3i7)

[**Vector Search RAG Tutorial – Combine Your Data with LLMs with Advanced Search - YouTube**](https://www.youtube.com/watch?v=JEBDfGqrAUA)
11 changes: 10 additions & 1 deletion docs/ai/llm/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

- [LLM Building](ai/llm/llm-building.md)
- [Design Patterns](ai/llm/design-patterns.md)
- [RAG Retrieval Augmented Generation](ai/llm/rag-retrieval-augmented-generation.md)
- [ChatGPT Prompt Engineering](ai/courses/chatgpt-prompt-eng.md)

MMLU - Massive Multitask Language Understanding
Expand All @@ -19,7 +20,8 @@ Moving from information to knowledge age
- [Meet Bard](https://bard.google.com/)
- https://openai.com/blog/chatgpt
- [Godmode](https://godmode.space/)
- [OpenAI Platform](https://platform.openai.com/)
- [**OpenAI Platform**](https://platform.openai.com/)
- [Embeddings - OpenAI API](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)
- [GPT-4](https://openai.com/research/gpt-4)
- [It’s Time to Pay Attention to A.I. (ChatGPT and Beyond)](https://www.youtube.com/watch?v=0uQqMxXoNVs)
- https://en.wikipedia.org/wiki/GPT-3
Expand Down Expand Up @@ -98,6 +100,13 @@ Moving from information to knowledge age
- Amazon EC2 P4d/P4de instances - Powered by NVIDIA A100 Tensor Core GPUs
- Amazon EC2 G5 instances - Powered by NVIDIA A10G Tensor Core GPUs

## Models

- [openai/whisper-large-v3 · Hugging Face](https://huggingface.co/openai/whisper-large-v3)
- Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
- [GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision](https://github.com/openai/whisper)
- [sentence-transformers/all-MiniLM-L6-v2 · Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

## Links

- [ChatGPT Prompt Engineering for Developers](ai/courses/chatgpt-prompt-eng.md)
Expand Down
4 changes: 3 additions & 1 deletion docs/ai/ml-algorithms/embeddings-and-estimators.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

## Embeddings

## An embedding of a vector is another vector in a smaller dimensional space
An embedding of a vector is another vector in a smaller dimensional space

- Manage sparse data
- Make machine learning models that use sparse data consume less memory and train faster
Expand All @@ -15,6 +15,8 @@

https://www.toptal.com/machine-learning/embeddings-in-machine-learning

[Embeddings  |  Machine Learning  |  Google for Developers](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture)

## Summary of Embeddings

![image](../../media/Embeddings-&-Estimators-image1.jpg)
Expand Down
1 change: 1 addition & 0 deletions docs/ai/ml-algorithms/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,6 @@
- [Feature Engineering](feature-engineering)
- [Regularization](regularization)
- [Embedding and Estimators](embeddings-and-estimators)
- [Vector Embeddings](ai/ml-algorithms/vector-embeddings.md)
- [Dimensionality Reduction](dimensionality-reduction)
- [Others](ai/ml-algorithms/others.md)
78 changes: 78 additions & 0 deletions docs/ai/ml-algorithms/vector-embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Vector Embeddings

**Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships.** They represent different data types as points in a multidimensional space, where similar data points are clustered closer together. These numerical representations help machines understand and process this data more effectively.

[Word](https://www.elastic.co/what-is/word-embedding) and sentence embeddings are two of the most common subtypes of vector embeddings, but there are others. Some vector embeddings can represent entire documents, as well as image vectors designed to match up visual content, user profile vectors to determine a user’s preferences, product vectors that help identify similar products and many others. Vector embeddings help [machine learning](https://www.elastic.co/what-is/machine-learning) algorithms find patterns in data and perform tasks such as [sentiment analysis](https://www.elastic.co/what-is/sentiment-analysis), language translation, recommendation systems, and many more.

![vector-embeddings](../../media/Pasted%20image%2020231216192551.png)

### Types of vector embeddings

#### [Word embeddings](https://www.elastic.co/what-is/word-embedding)

Represent individual words as vectors. Techniques like Word2Vec, GloVe, and FastText learn word embeddings by capturing semantic relationships and contextual information from large text corpora.

#### Sentence embeddings

Represent entire sentences as vectors. Models like Universal Sentence Encoder (USE) and SkipThought generate embeddings that capture the overall meaning and context of the sentences.

#### Document embeddings

Represent documents (anything from newspaper articles and academic papers to books) as vectors. They capture the semantic information and context of the entire document. Techniques like Doc2Vec and Paragraph Vectors are designed to learn document embeddings.

#### Image embeddings

Represent images as vectors by capturing different visual features. Techniques like convolutional neural networks (CNNs) and pre-trained models like ResNet and VGG generate image embeddings for tasks like image classification, object detection, and image similarity.

#### User embeddings

Represent users in a system or platform as vectors. They capture user preferences, [behaviors](https://www.elastic.co/what-is/user-behavior-analytics), and characteristics. User embeddings can be used in everything from recommendation systems to personalized marketing as well as user segmentation.

#### Product embeddings

Represent products in ecommerce or recommendation systems as vectors. They capture a product’s attributes, features, and any other semantic information available. Algorithms can then use these embeddings to compare, recommend, and analyze products based on their vector representations.

### Are embeddings and vectors the same thing?

In the context of vector embeddings, yes, embeddings and vectors are the same thing. Both refer to numerical representations of data, where each data point is represented by a vector in a high-dimensional space.

### Use Cases

1. Recommendation systems (i.e. Netflix-style if-you-like-these-movies-you’ll-like-this-one-too)
2. All kinds of search
1. Text search (like Google Search)
2. Image search (like Google Reverse Image Search)
3. Chatbots and question-answering systems
4. Data preprocessing (preparing data to be fed into a machine learning model)
5. One-shot/zero-shot learning (i.e. machine learning models that learn from almost no training data)
6. Fraud detection/outlier detection
7. Typo detection and all manners of “fuzzy matching”
8. Detecting when ML models go stale (drift)

[What are vector embeddings? | A Comprehensive Vector Embeddings Guide | Elastic](https://www.elastic.co/what-is/vector-embedding)

[Meet AI’s multitool: Vector embeddings | Google Cloud Blog](https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings)

[What are Vector Embeddings | Pinecone](https://www.pinecone.io/learn/vector-embeddings/)

## Text Embeddings / Transformers

[GitHub - SeanLee97/AnglE: Angle-optimized Text Embeddings | 🔥 SOTA on STS and MTEB Leaderboard](https://github.com/SeanLee97/AnglE)

[MTEB Leaderboard - a Hugging Face Space by mteb](https://huggingface.co/spaces/mteb/leaderboard)

[OpenAI Platform](https://platform.openai.com/tokenizer)

[sentence-transformers/all-MiniLM-L6-v2 · Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

[Pretrained Models — Sentence-Transformers documentation](https://www.sbert.net/docs/pretrained_models.html)

[sentence-transformers (Sentence Transformers)](https://huggingface.co/sentence-transformers)

## Links

[word-embedding-to-transformers](ai/nlp/word-embedding-to-transformers.md)

[**Vector Embeddings Tutorial – Code Your Own AI Assistant with GPT-4 API + LangChain + NLP - YouTube**](https://www.youtube.com/watch?v=yfHHvmaMkcA&ab_channel=freeCodeCamp.org)

[$0 Embeddings (OpenAI vs. free & open source) - YouTube](https://www.youtube.com/watch?v=QdDoFfkVkcw&ab_channel=RabbitHoleSyndrome)
1 change: 1 addition & 0 deletions docs/ai/nlp/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
- [NLTK](nltk)
- [Chatbot / Chatops](chatbot-chatops)
- [Chatbot SAAS](ai/nlp/chatbot-saas.md)
- [Word Embedding to Transformers](ai/nlp/word-embedding-to-transformers.md)
Loading

0 comments on commit 8fb6384

Please sign in to comment.