Skip to content

Latest commit

 

History

History
55 lines (35 loc) · 3.1 KB

categorical-variables.md

File metadata and controls

55 lines (35 loc) · 3.1 KB

Categorical Variables

There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. Here is a good Stack Overflow discussion. I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by PCA or other dimensionality reduction methods to reduce the number of dimensions (while still trying to preserve information).

Per a MOOC:

  • Label and frequency encodings are often used for tree-based models
  • One-hot encoding is often used for non-tree-based models (e.g. kNN, nerual networks)

This is a good summary of the common strategies. But not sure if they are really helpful as Jeremy didn't talk about them.

Change the data type from Object to Category

categorical_feats = [
    f for f in data.columns if data[f].dtype == 'object'
]

categorical_feats
for f_ in categorical_feats:
    data[f_], _ = pd.factorize(data[f_])
    # Set feature type as categorical
    data[f_] = data[f_].astype('category')
cols_to_exclude = ['Program_Year', 'Date_of_Payment', 'Payment_Publication_Date']
for col in df.columns:
    if df[col].nunique() < 600 and col not in cols_to_exclude:
        df[col] = df[col].astype('category')

Olivier Grellier, Senior Data Scientist at H2O.ai, does so.

Benefit

  • We can define a custom sort order which can improve summarizing and reporting the data. In the example above, “X-Small” < “Small” < “Medium” < “Large” < “X-Large”. Alphabetical sorting would not be able to reproduce that order.
  • Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plot types.
  • Categorical data uses less memory which can lead to performance improvements.

But make sure you define all the possible categories, otherwise any value you didn't define will become NaN. Search for "Let’s build" in this article for details.

Set the order

Per Jeremy, it is not very important to do so but good to do so.

df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'],
    ordered=True, inplace=True)