Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
deepaksood619 committed Feb 26, 2024
1 parent 1f55750 commit 0f9c1e8
Show file tree
Hide file tree
Showing 57 changed files with 713 additions and 580 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ In this context, the speed at which the data is generated and processed to meet
- Viscosity - data velocity relative to timescale of event being studied
- Volatility - rate of data loss and stable lifetime of data

![image](../../../media/Big-Data-image1.jpg)
![image](../../media/Big-Data-image1.jpg)

### Veracity

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

### Discretization

![image](../../../media/Data-Preprocessing-image1.jpg)
![image](../../media/Data-Preprocessing-image1.jpg)

### Attribute Transformation

Expand All @@ -50,7 +50,7 @@

p and q are the attribute values for two data objects

![image](../../../media/Data-Preprocessing-image2.jpg)
![image](../../media/Data-Preprocessing-image2.jpg)

### Types

Expand All @@ -63,39 +63,39 @@ p and q are the attribute values for two data objects

### Euclidean Distance

![image](../../../media/Data-Preprocessing-image3.jpg)
![image](../../media/Data-Preprocessing-image3.jpg)

- Where n is the number of dimensions (attributes) and p~k~ and q~k~ are, respectively, the k^th^ attributes (components) or data objects p and q.
- Standardization is necessary, if scales differ

### Mahalanobis Distance

![image](../../../media/Data-Preprocessing-image4.jpg)
![image](../../media/Data-Preprocessing-image4.jpg)

- For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6

### Cosine Similarity

![image](../../../media/Data-Preprocessing-image5.jpg)
![image](../../media/Data-Preprocessing-image5.jpg)

[Cosine Similarity - GeeksforGeeks](https://www.geeksforgeeks.org/cosine-similarity/)

[Cosine similarity: How does it measure the similarity, Maths behind and usage in Python | by Varun | Towards Data Science](https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db)

### Similarity Between Binary Vectors

![image](../../../media/Data-Preprocessing-image6.jpg)
![image](../../media/Data-Preprocessing-image6.jpg)

## Correlation

- Correlation measures the linear relationship between objects
- To compute correlation, we standardize data objects, p and q, and then take their dot product

![image](../../../media/Data-Preprocessing-image7.jpg)
![image](../../media/Data-Preprocessing-image7.jpg)

### Visually Evaluating Correlation

![image](../../../media/Data-Preprocessing-image8.jpg)
![image](../../media/Data-Preprocessing-image8.jpg)

- Scatter plots showing the similarity from -1 to 1

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
- A collection of attributes describe an object
- Object is also known as record, point, case, sample, entity, or instance

![image](../../../media/Data-image1.jpg)
![image](../../media/Data-image1.jpg)

## Types of Attributes

Expand All @@ -34,7 +34,7 @@ The type of an attribute depends on which of the following properties it possess
- Interval attribute: distinctness, order & addition
- Ratio attribute: all 4 properties

![image](../../../media/Data-image2.jpg)
![image](../../media/Data-image2.jpg)

## Discrete and Continuous Attributes

Expand Down
Original file line number Diff line number Diff line change
@@ -1,19 +1,12 @@
# Design of HBase

1. What is HBase

2. HBase Architecture

3. HBase Components

4. Data model

5. HBase Storage Hierarchy

6. Cross-Datacenter Replication

7. Auto Sharding and Distribution

8. Bloom Filter and Fold, Store, and Shift

## HBase is
Expand All @@ -32,7 +25,7 @@
- Stores saved as files on HDFS
- Hbase utilizes zookeeper for distributed coordination

![image](../../../media/Big-Data_Design-of-HBase-image1.jpg)
![image](../../media/Big-Data_Design-of-HBase-image1.jpg)

## HBase components

Expand Down Expand Up @@ -81,7 +74,7 @@

## HBase Architecture

![image](../../../media/Big-Data_Design-of-HBase-image2.jpg)
![image](../../media/Big-Data_Design-of-HBase-image2.jpg)

## Auto Sharding and Distribution

Expand All @@ -96,6 +89,6 @@

- Bloom filters are generated when HFile is pesisted

![image](../../../media/Big-Data_Design-of-HBase-image3.jpg)
![image](../../media/Big-Data_Design-of-HBase-image3.jpg)

![image](../../../media/Big-Data_Design-of-HBase-image4.jpg)
![image](../../media/Big-Data_Design-of-HBase-image4.jpg)
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
- Queried using **SQL (Structured Query Language)**
- Supports joins

![image](../../../media/Big-Data_Design-of-Key-Value-Stores-image1.jpg)
![image](../../media/Big-Data_Design-of-Key-Value-Stores-image1.jpg)

## Mismatch with today's workloads

Expand Down Expand Up @@ -61,7 +61,7 @@
- Don't always support joins or have foreign keys
- Can have index tables, just like RDBMSs

![image](../../../media/Big-Data_Design-of-Key-Value-Stores-image2.jpg)
![image](../../media/Big-Data_Design-of-Key-Value-Stores-image2.jpg)

## Column-Oriented Storage

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,24 @@ Zookeeper - Service for coordinating processes of distributed applications

## Classic Distributed System

![image](../../../media/Big-Data_Design-of-Zookeeper-image1.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image1.jpg)

- Most of the system like HDFS have one Master and couple of slave nodes and these slave nodes report to the master.

## Fault Tolerant Distributed System

![image](../../../media/Big-Data_Design-of-Zookeeper-image2.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image2.jpg)

- Real distributed fault tolerant system have Coordination service, Master and backup master
- If primary failed then backup works for it

![image](../../../media/Big-Data_Design-of-Zookeeper-image3.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image3.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image4.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image4.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image5.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image5.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image6.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image6.jpg)

## What is Coordination?

Expand Down Expand Up @@ -80,7 +80,7 @@ Zookeeper - Service for coordinating processes of distributed applications
- **Update responses are sent when a majority of servers have persisted the change**
- **We need 2f+1 machines to tolerate f failures**

![image](../../../media/Big-Data_Design-of-Zookeeper-image7.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image7.jpg)

3. **Ordered**
- **Zookeeper stamps each update with a number**
Expand All @@ -106,7 +106,7 @@ Zookeeper - Service for coordinating processes of distributed applications
- duck is child znode of zoo. It is denoted as /zoo/duck
- Though "." or ".." are invalid characters as opposed to the file system

![image](../../../media/Big-Data_Design-of-Zookeeper-image8.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image8.jpg)

## Data model - Znode - Types

Expand All @@ -118,7 +118,7 @@ Such kind of znodes remain in zookeeper until deleted. This is the default type
- Ephermal node gets deleted if the session in which the node was created has disconnected. Though it is tied to client's session but it is visible to the other users.
- An ephermal node can not have children not even ephermal children

![image](../../../media/Big-Data_Design-of-Zookeeper-image9.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image9.jpg)

## Architecture

Expand All @@ -142,7 +142,7 @@ Phase 1: Leader election (Paxos Algorithm)
- If leader fails, the remaining machines hold election takes 200ms
- If the majority of the machines aren't available at any point of time the leader automatically steps down

![image](../../../media/Big-Data_Design-of-Zookeeper-image10.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image10.jpg)

## Architecture: Phase 2

Expand All @@ -156,7 +156,7 @@ Phase 1: Leader election (Paxos Algorithm)
- The protocol for achieving consensus is atomic like two-phase commit
- Machines write to disk before in-memory

![image](../../../media/Big-Data_Design-of-Zookeeper-image11.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image11.jpg)

## Election in Zookeeper

Expand All @@ -168,15 +168,15 @@ Phase 1: Leader election (Paxos Algorithm)
- Gets highers id so far (from ZK file system), creates next-higher id, writes it into ZK file system
- Elect the highest-id server as leader

![image](../../../media/Big-Data_Design-of-Zookeeper-image12.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image12.jpg)

- Failures:
- One option: everyone monitors current master (directly or via a failure detector)
- On failure, initiate election
- Leads to a flood of elections
- Too many messages

![image](../../../media/Big-Data_Design-of-Zookeeper-image13.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image13.jpg)

- Second option: (implemented in Zookeeper)
- Each process monitors its next-higher id process
Expand All @@ -185,7 +185,7 @@ Phase 1: Leader election (Paxos Algorithm)
- **else**
- wait for a timeout, and check your successor again

![image](../../../media/Big-Data_Design-of-Zookeeper-image14.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image14.jpg)

- What about id conflicts? What if leader fails during election?
- To address this, Zookeeper uses a two-phase commit (run after the sequence/id) protocol to commit the leader
Expand All @@ -199,26 +199,26 @@ Phase 1: Leader election (Paxos Algorithm)

- If you have three nodes A, B, C with A as Leader. And A dies. Will someone become leader?

![image](../../../media/Big-Data_Design-of-Zookeeper-image15.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image15.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image16.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image16.jpg)

- If you have three nodes A, B, C and A and B die. Will C become Leader?

![image](../../../media/Big-Data_Design-of-Zookeeper-image17.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image17.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image18.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image18.jpg)

## Why do we need majority?

- Imagine: We have an ensemble spread over two data centres.

![image](../../../media/Big-Data_Design-of-Zookeeper-image19.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image19.jpg)

- Imagine: The network between data centres got disconnected. If we did not need majority for electing Leader
- What will happen?

![image](../../../media/Big-Data_Design-of-Zookeeper-image20.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image20.jpg)

- Each data centre will have their own Leader. No Consistency and utter Chaos. That is why it requires majority.

Expand All @@ -237,19 +237,19 @@ Phase 1: Leader election (Paxos Algorithm)
- Failover is handled automatically by the client
- Application can't remain agnostic of server reconnections because the operations will fail during disconnection

![image](../../../media/Big-Data_Design-of-Zookeeper-image21.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image21.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image22.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image22.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image23.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image23.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image24.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image24.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image25.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image25.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image26.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image26.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image27.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image27.jpg)

## Multi Update

Expand All @@ -258,15 +258,15 @@ Phase 1: Leader election (Paxos Algorithm)
- Possible to implement transactions
- Others never observe any inconsistent state

![image](../../../media/Big-Data_Design-of-Zookeeper-image28.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image28.jpg)

## Watches

- Clients to get notifications when a znode changes in some way
- Watchers are triggered only once
- For multiple notifications, re-register

![image](../../../media/Big-Data_Design-of-Zookeeper-image29.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image29.jpg)

## ACLs - Access Control Lists

Expand Down Expand Up @@ -298,7 +298,7 @@ Only single process may hold the lock

2. Extermely strong consistency

![image](../../../media/Big-Data_Design-of-Zookeeper-image30.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image30.jpg)

## Katta - Lucene & more in the cloud

Expand All @@ -317,14 +317,14 @@ Katta serves large, replicated, indices as shards to serve high loads and very l

http://katta.sourceforge.net

![image](../../../media/Big-Data_Design-of-Zookeeper-image31.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image31.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image32.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image32.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image33.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image33.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image34.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image34.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image35.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image35.jpg)

![image](../../../media/Big-Data_Design-of-Zookeeper-image36.jpg)
![image](../../media/Big-Data_Design-of-Zookeeper-image36.jpg)
File renamed without changes.
Loading

0 comments on commit 0f9c1e8

Please sign in to comment.