Skip to content

Latest commit

 

History

History
101 lines (57 loc) · 7.07 KB

feature-engineering.md

File metadata and controls

101 lines (57 loc) · 7.07 KB
description
Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. — Andrew Ng

Feature Engineering

Useful Approaches

Automated

Scaling

scikit-learn:

Indeed many estimators are designed with the assumption that each feature takes values close to zero or more importantly that all features vary on comparable scales. In particular, metric-based and gradient-based estimators often assume approximately standardized data (centered features with unit variances). A notable exception are decision tree-based estimators that are robust to arbitrary scaling of the data.

machinelearningmastery.com:

Decompose Categorical Attributes

Imagine you have a categorical attribute, like “Item_Color” that can be Red, Blue or Unknown.

Unknown may be special, but to a model, it looks like just another colour choice. It might be beneficial to better expose this information.

You could create a new binary feature called “Has_Color” and assign it a value of “1” when an item has a color and “0” when the color is unknown.

Going a step further, you could create a binary feature for each value that Item_Color has. This would be three binary attributes: Is_Red, Is_Blue and Is_Unknown.

These additional features could be used instead of the Item_Color feature (if you wanted to try a simpler linear model) or in addition to it (if you wanted to get more out of something like a decision tree).

2nd place of a Kaggle competition:

I calculated the lag of "date_first_booking" and "date_account_created" and divided this lag feature into four categories (0, [1, 365], [-349,0), NA).