day by day to be a Data Scientist

roadmap

First check this article on medium it will provide you will a lot of information: here

DAY 1 :

Topic: Learn fundamentals of Python:

Variables
Numbers
Strings
Lists
Dictionaries
Sets
Tuples
Control Structures
- If conditions
- For loops

Resource: Exploring The Power of python:part 1

DAY 2 :

Topic: Advance Python:

Functions
- Lambda Functions
- Module Management (pip install)
- File Operations (read/write)
- Object-Oriented Programming
- Classes
- Objects

Resources: Exploring the power of python:part 2

Note: first, you need to accomplish the project and then compare it with the given solution

Project: [Personal Finance Tracker and Budget Planner](Project: Personal Finance Tracker and Budget Planner)

Solution: Solution of the Project

Additional resource: all the builtins functions

Day 3:

Topic: Numpy

Numpy Array Basics
Array Inspection
Array Operations
Working with Numpy Arrays
NumPy for Data Cleaning
NumPy for Statistical Analysis
Advanced NumPy Techniques

Resources: Numpy Medium

Project MovieLens Project

Solution Solution of the Project

Additional resource summary about Numpy

DAY 4:

Topic: Pandas

What is Pandas
Installation and Setup
- How to install Pandas
- Setting up the environment
Pandas Data Structures
- Series: Basics and creation
- DataFrame: Basics, creation, and operations
Data Importing and Exporting
- Reading data from different sources (CSV, Excel, etc.)
- Writing data to files
Basic Data Operations
- Data selection and filtering
- Data sorting
- Handling missing values
Data Aggregation and Grouping
- Group by operations
- Aggregate functions (sum, mean, etc.)

Resource Pandas part1

DAY 5:

Topic:Intermediate Pandas

Advanced Data Selection
Data Transformation
Time Series Analysis in Pandas
Performance Enhancement Techniques

resources Pandas part 2:

Project Description of the project

Solution Kaggle

Additional Source time and categorical data

DAY 6 :

Topic: Matplotlib Fundamentals

Basic Plotting
Plot Types

DAY 7:

Topic: Advanced Matplotlib

Multiple Subplots
- 1.1 Creating Multiple Plots in a Single Figure
- 1.2 Combining Different Types of Plots
Advanced Features

Aditional resource Articel

Aditional resource jupyter Notebook: This file contains a wide range of techniques for better visualization.

DAY 8 :

Topic: Seaborn part 1:

scatter plots
line plots
bar plots
histograms
density plots
box plots
violin plots
heatmaps

resource datacamp article

DAY 9:

Topic: Seaborn Part 2:

pair plots
joint plots
facet grids
Customizing Seaborn plots
- Changing Color Palettes
- Adjusting Figure Size
- Adding Annotations

resource datacamp article

DAY 10:

Topic: Seaborn Part 3:

Today, we will explore various probability distributions and their visualizations using the Seaborn library.

Normal Distribution
Binomial Distribution
Poisson Distribution
Uniform Distribution
Logistic Distribution
Multinomial Distribution
Exponential Distribution
Chi Square Distribution
Rayleigh Distribution
Pareto Distribution
Zipf Distribution

resource w3scole article

Description here

Project jupyter NoteBook

DAY 11:

Topic: Data Cleaning

Data Inspection.
Handling missing values.
Data Imputation

We will discover all this throut this project:

Discription

Solution

DAY 12:

Topic: Statistique and Probability part 1:

Unit 1: Analyzing Categorical Data
- Topics include analyzing one categorical variable, two-way tables, distributions in two-way tables.
Unit 2: Displaying and Comparing Quantitative Data
- Covers displaying quantitative data with graphs, describing and comparing distributions, and more on data displays.
Unit 3: Summarizing Quantitative Data
- Focuses on measuring center in quantitative data, interquartile range, variance, and standard deviation.
Unit 4: Modeling Data Distributions
- Includes topics like percentiles, z-scores, density curves, and normal distributions.

Resource:

DAY 13:

Topic: Statistique and Probability part 2:

Unit 5: Exploring Bivariate Numerical Data
- Discusses scatterplots, correlation coefficients, trend lines, and regression.
Unit 6: Study Design
- Covers statistical questions, sampling methods, types of studies, and experiments.
Unit 7: Probability
- Topics include theoretical probability, set operations, experimental probability, and rules of probability.
Unit 8: Counting, Permutations, and Combinations
- Focuses on counting principle, permutations, combinations, and combinatorics.

Resource:

DAY 14:

Topic: Statistique and Probability part 3:

Unit 9: Random Variables
- Discusses discrete and continuous random variables, transforming and combining random variables, binomial and geometric distributions, and more.
Unit 10: Sampling Distributions
- Covers the concept of sampling distributions, including distributions of sample proportions and means.
Unit 11: Confidence Intervals
- Introduces confidence intervals and covers how to estimate population proportions and means.
Unit 12: Significance Tests (Hypothesis Testing)
- Explores the idea of significance tests, error probabilities, tests about population proportions and means, and more.

Resource:

DAY 15:

Topic: Statistique and Probability part 4:

Unit 13: Two-Sample Inference for the Difference Between Groups
- Focuses on comparing two proportions and two means, among other related topics.
Unit 14: Inference for Categorical Data (Chi-Square Tests)
- Discusses chi-square goodness-of-fit tests and chi-square tests for relationships.
Unit 15: Advanced Regression (Inference and Transforming)
- Covers inference about slope, nonlinear regression, and other advanced regression topics.
Unit 16: Analysis of Variance (ANOVA)
- Focuses on the analysis of variance (ANOVA) methodology.

Resource:

DAY 16:

Topic: Exploratory Data Analysis (EDA)

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Examine the data distribution
Handling missing values of the dataset(a most common issue with every dataset)
Handling the outliers
Removing duplicate data
Encoding the categorical variables
Normalizing and Scaling

Resource

Day 17:

Topic: SQL Basics Concepts

Creating Database: Learn how to create your own database.
Creating Tables and Adding Data: Understand how to create tables and insert data into them.
SELECT Clause: Learn to retrieve or fetch data from a database.
FROM Clause: Understand from which table in the database you need to select data.
WHERE Clause: Learn to form conditions based on which data have to be queried.
DELETE Statement: Understand how to perform deletion tasks.
INSERT INTO: Learn about insertion tasks.
AND and OR Operator: Know how to select data based on AND or OR conditions.
Drop and Truncate: Learn to drop or truncate collections as per the condition.
NOT Operator: Understand how to select data not based on a given condition.

Resources

DAY 18:

Topic: SQL Advanced Concepts

WITH Clause: Understanding the concept of the WITH clause and using it to name a sub-query block.
FETCH Clause: Learn to fetch the filtered data based on conditions, like fetching only the top 3 rows.
Arithmetic Operators: Use arithmetic operators for precise data filtering.
Wildcard Operators: Select exact data intelligently, like names starting or ending with 'T'.
UPDATE Statement: Learn updating data entries based on conditions.
ALTER Table: Know how to add, drop, or modify tables.
LIKE Clause: Understand pattern-based search.
BETWEEN and IN Operator: Learn to select data within a specified range.
CASE Statement: Understand conditional queries.
EXISTS: Learn to form nested queries for filtering data that exists in another query.

Resources

DAY 19:

Topic: SQL Aggregate Functions

DISTINCT Clause: Select only distinct, non-repetitive data.
Count Function: Learn to return the total count of filtered data.
Sum Function: Understand how to calculate the sum of queried data.
Average Function: Calculate the average of queried data.
Minimum Function: Learn to find the minimum value in queried data.
Maximum Function: Learn to find the maximum value in queried data.
ORDER BY: Order queried data in ascending or descending order.
GROUP BY: Group queried data by a specified column.
ALL and ANY Clause: Understand these logical operators and their boolean results.
TOP Clause: Learn to fetch a limited number of rows from a database.

Resources

DAY 20:

Topic: Joins in SQL

Union Clause: Understand the union of tables.
Intersection Clause: Learn to join tables at their intersection.
Aliases: Assign aliases to tables for later reference.
Cartesian Join and Self Join: Learn to join a table to itself.
Inner, Left, Right, and Full Joins: Understand these four types of joins.
Division Clause: Find entities interacting with all entities of a set of different types.
Using Clause: Modify NATURAL JOIN with the USING clause for columns with the same names but different datatypes.
Combining Values: Combine aggregate and non-aggregate values using Joins and Over clause.
MINUS Operator: Understand how to use the MINUS operator for exclusion.
Joining 3 or More Tables: Learn to join and query from three or more tables.

Resources

Project to practice first download the dataset here than try to practice all the SQL Queries in the readme file, after this you can see the correction here

Don't focus on the Tableau section; we will delve into it in the previous lessons.

DAY 21:

Topic: Introduction to Machine Learning

Machine Learning Definition
Examples and Use Cases
- Recommendation engines (Amazon, Spotify, Netflix).
- Speech recognition software.
- Fraud detection services in banks.
- Self-driving cars and driver assistance features.
How Does Machine Learning Work?
Machine Learning vs. Deep Learning
Types of Machine Learning
- Supervised Machine Learning: Trained on labeled data sets.
- Unsupervised Machine Learning: Uses unlabeled data sets to uncover patterns.
- Semi-supervised Machine Learning: Combines labeled and unlabeled data sets.
- Reinforcement Learning: Uses trial and error in specific environments.
Machine Learning Benefits and Risks

Resources

DAY 22 :

Topic: Steps in Machine Learning Project

Define the Problem: Identify the problem you want to solve and determine if machine learning can be used to solve it.
Collect Data: Gather and clean the data that you will use to train your model. The quality of your model will depend on the quality of your data.
Explore the Data: Use data visualization and statistical methods to understand the structure and relationships within your data.
Pre-process the Data: Prepare the data for modeling by normalizing, transforming, and cleaning it as necessary.
Split the Data: Divide the data into training and test datasets to validate your model.
Choose a Model: Select a machine learning model that is appropriate for your problem and the data you have collected.
Train the Model: Use the training data to train the model, adjusting its parameters to fit the data as accurately as possible.
Evaluate the Model: Use the test data to evaluate the performance of the model and determine its accuracy.
Fine-tune the Model: Based on the results of the evaluation, fine-tune the model by adjusting its parameters and repeating the training process until the desired level of accuracy is achieved.

Resources

practical Project : This project is quite popular. You can discover all these steps. Take your time because these steps exist in every project.

DAY 23 :

Topic: Exploring Scikit-Learn

first if you don´t know which algorithm are you going to use. You shouldn't be worried about it. Scikit-learn tells you what to do: We will discover:

Basic Example: A simple example using Scikit-Learn for machine learning.
Data Loading: Guidelines on data requirements and loading techniques.
Model Fitting: Instructions for fitting both supervised and unsupervised learning models.
Prediction: Methods to make predictions using different estimators.
Data Preprocessing: Techniques for standardization, normalization, binarization, and encoding categorical features.
Model Creation: Steps to create supervised and unsupervised learning estimators.
Model Evaluation: Various metrics for assessing the performance of models.
Model Tuning: Strategies for tuning models using grid search and randomized parameter optimization.

Resource This is a PDF file that can provide you with all of the above.

DAY 24 :

Topic: Advanced Scikit-Learn Features

Introduction to Scikit-Learn: Basic concepts and workflow.
Data Preprocessing: Techniques for preparing data for modeling.
Supervised Learning Models: Instructions for creating and using models like regression and classification.
Unsupervised Learning Models: Guides on clustering and dimensionality reduction.
Model Tuning and Evaluation: Tips on improving model performance and measuring accuracy.
Pipeline and Model Complexity: Insights into streamlining workflows and handling complex data scenarios.

Resources This is a PDF file that can provide you with all of the above.

Resource

DAY 25 :

Topic: Feature Engineering 1 - Handling Missing Values

Handling Missing values
- 1.1 Problems of Having Missing values
- 1.2 Understanding Types of Missing Values
- 1.3 Dealing MV Using SimpleImputer Method
- 1.4 Dealing MV Using KNN Imputer Method
Handling Categorical Values
- 2.1 One Hot Encoding
- 2.2 Label Encoding
- 2.3 Ordinal Encoding
- 2.4 Multi Label Binarizer
- 2.5 Count/Frequency Encoding
- 2.6 Target Guided Ordinal Encoding

DAY 26:

Topic: Feature Engineering 2 - Feature Scaling

Feature Scaling
- 1.1 Standardization/Standard Scaler
- 1.2 Normalization/MinMax Scaler
- 1.3 Max Abs Scaler
- 1.4 Robust Scaler

DAY 27:

Topic: Feature Engineering 3 - Feature Selection

Why Feature Selection Matters
Types of Feature Selection
Filter Methods
- Variance Threshold
- SelectKBest
- SelectPercentile
- GenericUnivariateSelect
Wrapper Methods
- RFE
- RFECV
- SelectFromModel
- SequentialFeatureSelector

DAY 28:

Topic: Feature Engineering 4 - Feature Transformation and Pipelines

Feature Transformation
- Understanding QQPlot and PP-Plot
- Logarithmic transformation
- Reciprocal transformation
- Square root transformation
- Exponential transformation
- Boxcox transformation
Using Pipelines to automate the FE
- What are Pipelines
- Accessing individual steps in pipeline
- Accessing Parameters in Pipeline
- Performing Grid Search with Pipeline
- Combining Transformers and Pipeline
- Visualizing the Pipeline

DAY 29:

Topic: Understanding Linear Regression

Fundamentals of Linear Regression
Exploring the Assumptions of Linear Regression
Gradient Descent and Loss Function
Evaluation Metrics for Linear Regression
Applications of Linear Regression

DAY 30:

Topic: Understanding Multicollinearity, and Regularization Techniques

Multiple Linear Regression
Multicollinearity
Regularization Techniques
Ridge, Lasso and Elastic Net
Polynomial Regression

DAY 31:

Topic: Understanding the Logistic Regression

How does Logistic Regression work
What is a sigmoid curve
Assumptions of Logistic Regression
Cost Function of Logistic Regression

DAY 32:

Topic: Understanding Decision Trees

Why do we need Decision Trees
How does Decision Trees work
How do we select a root node
Understanding Entropy, Information Gain
Solving an Example on Entropy
Understanding Gini Impurity
Solving an Example on Gini Impurity
Decision Trees for Regression
Why decision trees are Greedy Approach
Understanding Pruning

DAY 33:

Topic: Understanding Ensemble Techniques

What are Ensemble Techniques
Understanding Bagging
Understanding Boosting
Understanding Stacking

DAY 34:

Topic: Understanding Random Forests

Decision Trees Aggregation
Bagging and Variance Reduction
Feature Subspace sampling
Handling Overfitting
Out of bag error

DAY 35:

Topic: Understanding Boosting Algorithms

Concept of Boosting
Understanding Ada Boost
Solving an Example on AdaBoost
Understanding Gradient Boosting
Solving an Example on Gradient Boosting
AdaBoost vs Gradient Boosting

DAY 36:

Topic: Understanding XG Boost Algorithms

Concept of XGBoost Algorithm
Boosting Mechanism
Feature Importance Interpretation
Regularization Techniques
Flexibility and Scalability

DAY 37:

Topic: Understanding K Nearest Neighbours

How does K-Nearest Neighbours work
How is Distance Calculated
- Euclidean Distance
- Hamming Distance
- Manhattan Distance
Why is KNN a Lazy Learner
Effects of Choosing the value of K
Different ways to perform KNN
Understanding KD-Tree
Solving an Example of KD Tree
Understanding Ball Tree

DAY 38:

Topic: Understanding Support Vector Machines

Understanding Concept of SVC
What are Support Vectors
What is Margin
Hard Margin and Soft Margin
Kernelized SVC
Types of Kernels
Understanding SVR

DAY 39:

Topic: Understanding Naive Bayes Classifiers

Why do we need Naive Bayes
Concept of how it works
Mathematical Intuition of Naive Bayes
Solving an Example on Naive Bayes
Other Bayes Classifiers
- Gaussian Naive Bayes Classifier
- Multinomial Naive Bayes Classifier
- Bernoulli Naive Bayes Classifier

DAY 40:

Topic: Understanding Clustering Techniques

How clustering is different from classification
Applications of Clustering
What are density based methods
What are Hierarchial based methods
What are partitioning methods
What are Grid Based methods
Main Requirements for Clustering Algorithms

DAY 41:

Topic: Understanding K-Means Clustering

Concept of K-Means Clustering
Math Intuition Behind K-Means
Cluster Building Process
Edge Case Scenarios of K-Means
Challenges and Improvements in K-Means

DAY 42:

Topic: Understanding Hierarchical Clustering

Concept of Hierarchical Clustering
Understanding Algorithm
Understanding Linkage Methods

DAY 43:

Topic: Understanding DB SCAN Clustering

Concept of DB SCAN
Key words in understanding DB SCAN
Algorithm of DB SCAN

DAY 44:

Topic: Evaluation of Clustering Models

Understanding External Measures
- Rand Index
- Jaccard Co-efficient
Understanding Internal Measures
- Cohesion
- Separation

DAY 45:

Topic: Understanding Curse of Dimensionality

Computational Complexity
Data Visualization Challenges

DAY 46:

Topic: Understanding Principal Component Analysis

Idea Behind PCA
What are Principal Components
Eigen Decomposition Approach
Singular Value Decomposition Approach
Why do we maximize Variance
What is Explained Variance Ratio
How to select optimal number of Principal Components
Understanding Scree plot
Issues with PCA
Understanding Kernel PCA

DAY 47:

Topic: Supervised Algorithms Revision

Regression Algorithms
1. Linear Regression
2. Polynomial Regression
Classification Algorithms
1. K-Nearest Neighbours
2. Logistic Regression
Both Classification and Regression
1. Decision Trees
2. Random Forest
3. Gradient Boosting
4. Ada Boost
5. Ridge Regression
6. Lasso Regression

DAY 48:

Topic: UnSupervised Algorithms Revision

Clustering Algorithms
1. K-Means
2. DBSCAN
3. HDBSCAN
4. Hierarchical
Dimensionality Reduction Techniques
1. PCA
2. t-SNE
3. ICA
Association Rules
1. Apriori
2. FP-growth
3. FP-Max

DAY 49:

Topic: Big Mart Sales Prediction Project Understanding

Understanding the Data

DAY 50:

Topic: EDA for Big Mart Sales

Dealing with Null Values
Data Visualization of the Numeric Columns
Feature Engineering of the Numeric Columns

DAY 51:

Topic: Data Visualization

Data Visualization of the Categorical Columns
Feature Engineering of the Categorical Columns

DAY 52:

Topic: Model Building and Evaluation

Model Selection: Choosing the right model for the problem (classification, regression, etc.).
Training and Testing: Splitting data into training and testing sets to evaluate model performance.
Evaluation Metrics: Using metrics like accuracy, precision, recall, and MSE for performance assessment.
Cross-Validation: Implementing cross-validation techniques for more reliable model evaluation.
Model Interpretability: Understanding and explaining model decisions.

DAY 53

Topic: Hyperparameter Tuning the Models

Hyperparameter Basics: Understanding what hyperparameters are in machine learning models.
Tuning Techniques: Introducing Grid Search, Random Search, and Bayesian Optimization.
Practical Implementation: Applying hyperparameter tuning on a sample model.
Performance Impact: Assessing how hyperparameters influence model outcomes.
Best Practices: Discussing balance in model complexity and overfitting.

DAY 54: History of Deep Learning

Topics:

Early Developments: Tracing the origins and initial concepts of neural networks.
Key Milestones: Highlighting major breakthroughs and influential models in deep learning.
Deep Learning Resurgence: Understanding the factors contributing to the modern rise of deep learning.
Influential Models: Overview of landmark models in deep learning history.
Future Trends: Discussing current trends and potential future developments in deep learning.

DAY 55: Deep Learning Frameworks

Topics:

Introduction to TensorFlow
Introduction to PyTorch
Comparison of Frameworks
Setting Up a Simple Neural Network in Both Frameworks

DAY 56: Introduction to Neural Networks

Topics:

Neural Network Structure: Exploring the basic architecture including neurons and layers.
Forward Propagation: Understanding how data is processed in a neural network.
Backpropagation and Training: Learning the mechanism of training neural networks.
Activation Functions: Introduction to different activation functions and their purposes.
Simple Implementation: Hands-on example of creating a basic neural network.

Files

README.md

Latest commit

History

README.md

File metadata and controls

day by day to be a Data Scientist

roadmap

DAY 1 :

Topic: Learn fundamentals of Python:

DAY 2 :

Topic: Advance Python:

Day 3:

Topic: Numpy

DAY 4:

Topic: Pandas

DAY 5:

Topic:Intermediate Pandas

DAY 6 :

Topic: Matplotlib Fundamentals

DAY 7:

Topic: Advanced Matplotlib

DAY 8 :

Topic: Seaborn part 1:

DAY 9:

Topic: Seaborn Part 2:

DAY 10:

Topic: Seaborn Part 3:

Today, we will explore various probability distributions and their visualizations using the Seaborn library.

DAY 11:

Topic: Data Cleaning

DAY 12:

Topic: Statistique and Probability part 1:

DAY 13:

Topic: Statistique and Probability part 2:

DAY 14:

Topic: Statistique and Probability part 3:

DAY 15:

Topic: Statistique and Probability part 4:

DAY 16:

Topic: Exploratory Data Analysis (EDA)

Day 17:

Topic: SQL Basics Concepts

DAY 18:

Topic: SQL Advanced Concepts

DAY 19:

Topic: SQL Aggregate Functions

DAY 20:

Topic: Joins in SQL

DAY 21:

Topic: Introduction to Machine Learning

DAY 22 :

Topic: Steps in Machine Learning Project

DAY 23 :

Topic: Exploring Scikit-Learn

DAY 24 :

Topic: Advanced Scikit-Learn Features

DAY 25 :

Topic: Feature Engineering 1 - Handling Missing Values

DAY 26:

Topic: Feature Engineering 2 - Feature Scaling

DAY 27:

Topic: Feature Engineering 3 - Feature Selection

DAY 28:

Topic: Feature Engineering 4 - Feature Transformation and Pipelines

DAY 29:

Topic: Understanding Linear Regression

DAY 30:

Topic: Understanding Multicollinearity, and Regularization Techniques

DAY 31:

Topic: Understanding the Logistic Regression

DAY 32:

Topic: Understanding Decision Trees

DAY 33:

Topic: Understanding Ensemble Techniques

DAY 34:

Topic: Understanding Random Forests

DAY 35:

Topic: Understanding Boosting Algorithms

DAY 36: