There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. Here is a good Stack Overflow discussion. I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).
Per a MOOC:
- Label and frequency encodings are often used for tree-based models
- One-hot encoding is often used for non-tree-based models (e.g. kNN, nerual networks)
This is a good summary of the common strategies. But not sure if they are really helpful as Jeremy didn't talk about them.
categorical_feats = [
f for f in data.columns if data[f].dtype == 'object'
]
categorical_feats
for f_ in categorical_feats:
data[f_], _ = pd.factorize(data[f_])
# Set feature type as categorical
data[f_] = data[f_].astype('category')
cols_to_exclude = ['Program_Year', 'Date_of_Payment', 'Payment_Publication_Date']
for col in df.columns:
if df[col].nunique() < 600 and col not in cols_to_exclude:
df[col] = df[col].astype('category')
Olivier Grellier, Senior Data Scientist at H2O.ai, does so.
- We can define a custom sort order which can improve summarizing and reporting the data. In the example above, “X-Small” < “Small” < “Medium” < “Large” < “X-Large”. Alphabetical sorting would not be able to reproduce that order.
- Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plot types.
- Categorical data uses less memory which can lead to performance improvements.
But make sure you define all the possible categories, otherwise any value you didn't define will become NaN
. Search for "Let’s build" in this article for details.
Per Jeremy, it is not very important to do so but good to do so.
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'],
ordered=True, inplace=True)