Skip to content

Latest commit

 

History

History
88 lines (75 loc) · 5.56 KB

auto_pipeline.md

File metadata and controls

88 lines (75 loc) · 5.56 KB

Build Customized Data Science Pipeline

In the industry, you do need to build a customized data science pipeline to solve most of the problems in your company.

Industry Design Examples

  • Google AutoML Table
    • From structured data to dashboard, the whole system design is very smooth

Pipeline Tools

  • Luigi
  • Airflow
  • Orchest
    • The pipeline it allows you to build can use both .py and ipython notebooks, looks convenient

Cloud Platforms

General Architecture

Core Parts

Feature Store

  • Reusable features for both online and offline usage, you are also monitor and validate the results
  • Feast is open source
    • In each feature view, there are stored features, entity (such as primary key) and data source
    • If you want to join features from different feature views, you can still use entity
  • Tecton is a paid tool

Param Tuning (HPO)

  • Hyperopt
  • Optuna
    • Quick start - You will see how to tune param with multiple estimators and params, nice visualization too
    • Optuna vs Hyperopt
      • It compared the 2 tools from different important aspects, and in every aspects, Optuna appears to be better overall
      • Optuna's TPE appears better than hyperopt's Adaptive TPE
    • Using Optuna in different models
  • Keras Tuner
    • It can be used to tune neural networks, the user interface is similar to Optuna
    • Different types of tuners: hyperband, bayesian optimization and random search
    • It also provides a tuner for sklearn models, sklearn tuner
  • FLAML
    • In some cases, FLAML can be more efficient than optuna in param tuning and even deliver better testing performance within a shorter time
    • It developed 2 searching algorithms (CFO, Blend Search), CFO works faster with higher testing performance in many cases
  • Bayesian Optimizaton
    • It considers past model info to select params for the new model
    • "This is a constrained global optimization package built upon bayesian inference and gaussian process, that attempts to find the maximum value of an unknown function in as few iterations as possible. This technique is particularly suited for optimization of high cost functions, situations where the balance between exploration and exploitation is important."
    • Example for tuning CatBoost, LightGBM, XGBoost
    • An example
      • Bayes_opt may not be faster than hyperopt but you can stop whenever you want and get current best results. It also shows the tuning progress that contains which value got selected in each trial

Model Selection

  • Google Model Search
    • "The Model Search system consists of multiple trainers, a search algorithm, a transfer learning algorithm and a database to store the various evaluated models. The system runs both training and evaluation experiments for various ML models (different architectures and training techniques) in an adaptive, yet asynchronous fashion. While each trainer conducts experiments independently, all trainers share the knowledge gained from their experiments. At the beginning of every cycle, the search algorithm looks up all the completed trials and uses beam search to decide what to try next. It then invokes mutation over one of the best architectures found thus far and assigns the resulting model back to a trainer."
    • Model Search Intro
  • MLJAR is a nice automl tool
    • Besides EDA, model selection and param tuning, it will stack models at the end to achieve better results

Security Threats to Machine Learnig Systems