Build Customized Data Science Pipeline

In the industry, you do need to build a customized data science pipeline to solve most of the problems in your company.

Industry Design Examples

Google AutoML Table
- From structured data to dashboard, the whole system design is very smooth

Pipeline Tools

Luigi
Airflow
Orchest
- The pipeline it allows you to build can use both .py and ipython notebooks, looks convenient

Cloud Platforms

Google Cloud Platform (GCP) for ML

General Architecture

Some points to be considered

Core Parts

Feature Store

Reusable features for both online and offline usage, you are also monitor and validate the results
Feast is open source
- In each feature view, there are stored features, entity (such as primary key) and data source
- If you want to join features from different feature views, you can still use entity
Tecton is a paid tool
- We can borrow some ideas about how it works: https://docs.tecton.ai/

Param Tuning (HPO)

Hyperopt
Optuna
- Quick start - You will see how to tune param with multiple estimators and params, nice visualization too
- Optuna vs Hyperopt
  - It compared the 2 tools from different important aspects, and in every aspects, Optuna appears to be better overall
  - Optuna's TPE appears better than hyperopt's Adaptive TPE
- Using Optuna in different models
Keras Tuner
- It can be used to tune neural networks, the user interface is similar to Optuna
- Different types of tuners: hyperband, bayesian optimization and random search
- It also provides a tuner for sklearn models, sklearn tuner
FLAML
- In some cases, FLAML can be more efficient than optuna in param tuning and even deliver better testing performance within a shorter time
- It developed 2 searching algorithms (CFO, Blend Search), CFO works faster with higher testing performance in many cases
Bayesian Optimizaton
- It considers past model info to select params for the new model
- "This is a constrained global optimization package built upon bayesian inference and gaussian process, that attempts to find the maximum value of an unknown function in as few iterations as possible. This technique is particularly suited for optimization of high cost functions, situations where the balance between exploration and exploitation is important."
- Example for tuning CatBoost, LightGBM, XGBoost
  - Description
- An example
  - Bayes_opt may not be faster than hyperopt but you can stop whenever you want and get current best results. It also shows the tuning progress that contains which value got selected in each trial

Model Selection

Google Model Search
- "The Model Search system consists of multiple trainers, a search algorithm, a transfer learning algorithm and a database to store the various evaluated models. The system runs both training and evaluation experiments for various ML models (different architectures and training techniques) in an adaptive, yet asynchronous fashion. While each trainer conducts experiments independently, all trainers share the knowledge gained from their experiments. At the beginning of every cycle, the search algorithm looks up all the completed trials and uses beam search to decide what to try next. It then invokes mutation over one of the best architectures found thus far and assigns the resulting model back to a trainer."
- Model Search Intro
MLJAR is a nice automl tool
- Besides EDA, model selection and param tuning, it will stack models at the end to achieve better results

Security Threats to Machine Learnig Systems

Some threats before/during/after model training, interesting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto_pipeline.md

auto_pipeline.md

Build Customized Data Science Pipeline

Industry Design Examples

Pipeline Tools

Cloud Platforms

General Architecture

Core Parts

Feature Store

Param Tuning (HPO)

Model Selection

Security Threats to Machine Learnig Systems

Files

auto_pipeline.md

Latest commit

History

auto_pipeline.md

File metadata and controls

Build Customized Data Science Pipeline

Industry Design Examples

Pipeline Tools

Cloud Platforms

General Architecture

Core Parts

Feature Store

Param Tuning (HPO)

Model Selection

Security Threats to Machine Learnig Systems