Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
deepaksood619 committed Dec 6, 2023
1 parent 2b000d5 commit e2b3eac
Show file tree
Hide file tree
Showing 74 changed files with 374 additions and 466 deletions.
4 changes: 2 additions & 2 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,5 @@ services:
# npm ci
# npm start
volumes:
- .:/app/
# volumes:
# - .:/app/
3 changes: 2 additions & 1 deletion docs/ai/data-science/big-data/processing-engine.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Processing Engine

A processing engine, sometimes called a processing framework, is responsible for performing data processing tasks (an illuminating explanation, I know). A comparison is probably the best way to understand this. Apache Hadoop is an open source software platform that also deals with "Big Data" and distributed computing. Hadoop has a processing engine, distinct from Spark, called MapReduce. MapReduce has its own particular way of optimizing tasks to be processed on multiple nodes and Spark has a different way. One of Sparks strengths is that it is a processing engine that can be used on its own, or used in place of Hadoop MapReduce, taking advantage of the other features of Hadoop.
A processing engine, sometimes called a processing framework, is responsible for performing data processing tasks. A comparison is probably the best way to understand this. Apache Hadoop is an open source software platform that also deals with "Big Data" and distributed computing. Hadoop has a processing engine, distinct from Spark, called MapReduce. MapReduce has its own particular way of optimizing tasks to be processed on multiple nodes and Spark has a different way. One of Sparks strengths is that it is a processing engine that can be used on its own, or used in place of Hadoop MapReduce, taking advantage of the other features of Hadoop.

Processing frameworks compute over the data in the system, either by reading from non-volatile storage or as it is ingested into the system. Computing over data is the process of extracting information and insight from large quantities of individual data points.

Expand All @@ -9,6 +9,7 @@ Processing frameworks compute over the data in the system, either by reading fro
- Stream-only frameworks:
- [Apache Storm](https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared#apache-storm)
- [Apache Samza](https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared#apache-samza)
- [Keystone Real-time Stream Processing Platform | by Netflix Technology Blog | Netflix TechBlog](https://netflixtechblog.com/keystone-real-time-stream-processing-platform-a3ee651812a)
- Hybrid frameworks:
- [Apache Spark](https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared#apache-spark)
- [Apache Flink](https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared#apache-flink)
Expand Down
10 changes: 5 additions & 5 deletions docs/ai/ml-algorithms/association-rules-and-apriori-algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,21 @@

- **Itemset**
- A collection of one or more items
- Example: {Milk, Bread, Diaper}
- Example: `{Milk, Bread, Diaper}`
- k-itemset
- An itemset that contains k items
- **Support count (σ)**
- Frequency of occurrence of an itemset
- E.g. σ({Milk, Break, Diaper}) = 2
- E.g. `σ({Milk, Break, Diaper}) = 2`
- **Support**
- Fraction of transactions that contain an itemset
- E.g. s({Milk, Bread, Diaper}) = 2/5
- E.g. `s({Milk, Bread, Diaper}) = 2/5`
- **Frequent Itemset**
- An itemset whose support is greater than or equal to a *minsup* threshold
- **Association Rule**
- An implication expression of the form X->Y, where X and Y are itemsets
- An implication expression of the form `X->Y`, where X and Y are itemsets
- Example
- {Milk, Diaper} -> {Beer}
- `{Milk, Diaper} -> {Beer}`
- **Rule Evaluation Metrics**
- Support (s)
- Fraction of transactions that contain both X and Y
Expand Down
8 changes: 4 additions & 4 deletions docs/ai/ml-algorithms/predictive-analytics-1.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
## Bootstrap

- Bootstrapping is an algorithm which produces replicas of a data set by doing random sampling with replacement. This idea is essential for the random forest algorithm
- Consider a dataset Z={(x1, y1),...,(xn,yn)}
- Consider a dataset `Z={(x1, y1),...,(xn,yn)}`
- Bootstrapped dataset Z* - It is a modification of the original dataset Z, produced by random sampling with replacement.

## Sampling with Replacement
Expand Down Expand Up @@ -55,7 +55,7 @@

## Why does bagging work?

- Model f(x) has higher predictive power than any single f^xb^(x), b=1,...,B
- Model f(x) has higher predictive power than any single `f^xb^(x), b=1,...,B`
- Most of situations with any machine learning method in the core, the quality of such aggregated predictions will be better than of any single prediction
- The phenomenon is based on a very general principle which is called the bias variance trade off. You can consider the training data set to be random by itself.

Expand Down Expand Up @@ -85,12 +85,12 @@
## How to grow a random forest decision tree

- The tree is built **greedily** from top to bottom
- Select m <= p of the input variables at random as candidates for splitting
- Select `m <= p` of the input variables at random as candidates for splitting
- Each split is selected to maximize information gain (IG)

![image](../../media/Predictive-Analytics-1-image10.jpg)

- Select m <= p of the input variables at random as candidates for splitting
- Select `m <= p` of the input variables at random as candidates for splitting
- Recommendations from inventors of Random Forests
- m = sqroot(p) for classification, minInstance PerNode = 1
- m = p/3 for regression, minInstancePerNode = 5
Expand Down
16 changes: 8 additions & 8 deletions docs/ai/ml-algorithms/rule-generation-and-pattern-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,31 +13,31 @@ In data mining, what would be a monotonic function would be the support function
- How to efficiently generate rules from frequent itemsets?
- In general, confidence does not have an anti-monotone property
- But confidence of rules generated from the same itemset has an anti-monotone property
- e.g., L = {A,B,C,D}
- e.g., `L = {A,B,C,D}`

c (ABC -> D) >= c(AB -> CD) >= c(A -> BCD)
`c (ABC -> D) >= c(AB -> CD) >= c(A -> BCD)`

- Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Confidence(X -> Y) - Measures how often transactions Y apper in transactions that contain X
Confidence `(X -> Y)` - Measures how often transactions Y apper in transactions that contain X

c(ABC -> D) = |ABCD| / |ABC|
`c(ABC -> D) = |ABCD| / |ABC|`

Therefore

|ABCD| / |ABC| >= |ABCD| / |AB| >= |ABCD| / |A|
`|ABCD| / |ABC| >= |ABCD| / |AB| >= |ABCD| / |A|`

Since,

|ABC| <= |AB| <= |A|
`|ABC| <= |AB| <= |A|`

## Rule Generation for Apriori Algorithm

![image](../../media/Rule-generation-&-Pattern-Evaluation-image1.jpg)

- Candidate rule is generated by merging two rules that share the same prefix in the rule consequent
- join (CD => AB, BD => AC) would produce the candidate rule D => ABC
- Prune rule D => ABC if its subset AB => BC does not have high confidence
- `join (CD => AB, BD => AC) would produce the candidate rule D => ABC`
- `Prune rule D => ABC if its subset AB => BC does not have high confidence`

![image](../../media/Rule-generation-&-Pattern-Evaluation-image2.jpg)

Expand Down
2 changes: 1 addition & 1 deletion docs/ai/move-37/2-dynamic-programming.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ In my experience,*policy iteration*is faster than*value iteration*, as a policy

Why Discount Factor?

The idea of using discount factor is to prevent the total reward from going to infinity (because 0 <= γ <= 1). It also models the agent behavior when the agent prefers immediate rewards than rewards that are potentially received later in the future.
The idea of using discount factor is to prevent the total reward from going to infinity (because `0 <= γ <= 1`). It also models the agent behavior when the agent prefers immediate rewards than rewards that are potentially received later in the future.

![image](../../media/2.-Dynamic-Programming-image4.jpg)

Expand Down
2 changes: 1 addition & 1 deletion docs/ai/numpy/data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ A data type object describes interpretation of fixed block of memory correspondi
- In case of structured type, the names of fields, data type of each field and part of the memory block taken by each field.
- If data type is a subarray, its shape and data type

The byte order is decided by prefixing '<' or '>' to data type. '<' means that encoding is little-endian (least significant is stored in smallest address). '>' means that encoding is big-endian (most significant byte is stored in smallest address).
The byte order is decided by prefixing `<` or `>` to data type. `<` means that encoding is little-endian (least significant is stored in smallest address). `>` means that encoding is big-endian (most significant byte is stored in smallest address).

A dtype object is constructed using the following syntax

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ To solve it, we can try to apply a modification of the Self-Organizing Map (SOM)

## Arora PTAS for Euclidean TSP

The Travelling Salesman Problem (TSP) is one of the most famous problems in Computer Science, but it turns out to be NP-Hard. It's even NP-hard to approximate it to any polynomial factor in the general case! Thankfully we can do a constant (around 1.5) approximation when the distances for which we are solving the problem come from a metric. While exactly what this constant is remains open, we know we cannot have a PTAS in the general case. What's a PTAS? It's a Polynomial Time Approximation Scheme. The idea is that you give me an epsilon, I will give you a (1+epsilon) approximation algorithm whose runtime depends on epsilon but is polynomial in n. So we have runtimes like poly(n) 2^{1/epsilon} and others.
The Travelling Salesman Problem (TSP) is one of the most famous problems in Computer Science, but it turns out to be NP-Hard. It's even NP-hard to approximate it to any polynomial factor in the general case! Thankfully we can do a constant (around 1.5) approximation when the distances for which we are solving the problem come from a metric. While exactly what this constant is remains open, we know we cannot have a PTAS in the general case. What's a PTAS? It's a Polynomial Time Approximation Scheme. The idea is that you give me an epsilon, I will give you a (1+epsilon) approximation algorithm whose runtime depends on epsilon but is polynomial in n. So we have runtimes like poly(n) `2^{1/epsilon}` and others.

Sanjeev Arora discovered a PTAS for TSP when the distances come from a Euclidean space a couple of decades ago. This is very good news for Uber and the like, since their distances usually come from the plane! The idea is not hard to understand, and I plan to make the talk accessible to anyone who is comfortable with Dynamic Programming.

Expand Down
Loading

0 comments on commit e2b3eac

Please sign in to comment.