- Big Data
- Hadoop
- Spark
- PySpark
- Machine Learning using PySpark
- Introduction to Machine Learning
- Supervised vs Unsupervised
- Classification vs Regression
- Data Ingestion
- Data Wrangling
- Data Preprocessing
- Model Training
- Model Validation
- Deployment
- Driver
- Executors
- Partitions
- Jobs
- Stages
- Tasks
- Resilient Distributed Datastructure
- DataFrames as a High Level Datastructure
- Creation of RDD
- Transformation methods
- Aggregation methods
- Actions
- Caching
- Debugging
- Loading CSV, JSON & parquet
- Connecting to databases
- Getting data from streaming server
- Descriptive Statistics
- Accessing subsets of data - Rows, Columns, Filters
- Handling Missing Data
- Dropping rows & columns
- Handling Duplicates
- Aggregate functions
- Merge, Join & Concatenate
- Why Preprocessing ?
- Scaling Techniques
- Encoding Techniques
- Text Processing
- Dimensionality Reduction
- Vectorization of Data
- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
- GBT Regressor
- Evaluation of Regression Models
- LogisticRegression
- DecisionTreeClassifier
- GBT Classifier
- RandomForestClassifier
- NaiveBayes
- MultiLayerPerceptronClassifier
- Evaluation of Classification Models
- Motivation behind clustering
- KMeans
- GaussianMixtureModel
- Latent Dirichlet Allocation
- Composite Estimators using Pipelines
- Model Selection
- Hyper-parameter Tuning
- Persisting trained models
- Deployment