Skip to content

Feature Column or Keras Preprocessing Layer

Qinlong Wang edited this page Feb 20, 2020 · 8 revisions

Feature Column or Keras Preprocessing Layer

Problem

There are two options for feature engineering in TensorFlow: feature column api and keras preprocessing layers (numeric inputs and categorical inputs).

In the data analysis and transform design, we proposed some transform functions to extend the COLUMN syntax. We will generate the python code for feature engineering from the COLUMN clause. We will discuss which api the generated code is built upon - feature column or keras preprocess layer?

Long Term Trend From Open Source Community

In the motivation part from the RFC named Keras Category Inputs, we can see that the community plans to develop Keras Preprocess Layer to replace the feature column api. These layers will be released in TF2.2.

Three pain points for feature column are mentinod in this doc. The following points are copied from the RFC:

* Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416).
* Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`.
* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.
  1. Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this Github issue.

In the next code snippet, we use feature columns to transform feature for a DNN model and we have to use the feature names (e.g. "color", "frequencies") to define both feature_columns and tf.keras.Input. What's more, some feature_column are derived from other feature_columns and we don't need to create tf.keras.Input for them like indicator_column in next code snippet.

Code Snippet 1

import numpy as np
import tensorflow as tf

color_column = tf.feature_column.categorical_column_with_vocabulary_list(
    'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_input = tf.keras.Input(name='color', shape=(1,), dtype=tf.string)

weighted_column = tf.feature_column.weighted_categorical_column(
    categorical_column=color_column, weight_feature_key='frequencies'
)
frequencies_input = tf.keras.Input(name='frequencies', shape=(1,), dtype=tf.float32)

indicator_column = tf.feature_column.indicator_column(weighted_column)

inputs = {
    'color': color_input,
    'frequencies': frequencies_input
}
feature_layer = tf.keras.layers.DenseFeatures(indicator_column)
feature_value = feature_layer(inputs)
dense = tf.keras.layers.Dense(1)(feature_value)

model = tf.keras.Model(inputs=inputs, outputs=dense)
model.compile(optimizer='sgd', loss='mse')

x = {
    'color': tf.constant([['R'],['G'],['B']]),
    'frequencies': tf.constant([[0.11],[0.23],[0.87]])
}
y = tf.constant([[1], [0], [0]])
model.fit(x, y, epochs=5)
  1. Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through tf.keras.layers.DenseFeatures.

indicator_column can be used to wrap any categorical columns and crossed_column to represents multi-hot representation of the given column. However, the multi-hot representation using a dense matrix will incur large memory footprint.

Code Snippet 2

color_column = tf.feature_column.categorical_column_with_vocabulary_list(
    'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
color_one_hot = feature_column.indicator_column(color_column)

if the values of color are [['R'],['G'],['B']], the output of indicator_column is

np.array([
	[1,0,0],
	[0,1,0],
	[0,0,1]
)

The output will be very sparse if the voabulary number in vocabulary_list is very large.

  1. Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs.

For example, we may represent text documents as a collection of word frequencies in NLP. If we want to feed keras linear model or dense layer with weighted categorical inputs, we have to use indicator_column or embedding_column to wrap weighted_categorical_column for DenseFeatures. In Code Snippet 1, we have showed an example with weighted categorical inputs. We use indicator_column to wrap weighted_categorical_column to feed keras linear model. Because the weighted_column can not be accepted by DenseFeatures. However, the above 2nd problem is in the solution.

The another way is that we can use embedding_column instead of indicator_column to wrap weighted_categorical_column for DenseFeature to avoid the above 2nd problem.

Code Snippet 3

import numpy as np
import tensorflow as tf

color_column = tf.feature_column.categorical_column_with_vocabulary_list(
    'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
)
weighted_column = tf.feature_column.weighted_categorical_column(
    categorical_column=color_column, weight_feature_key='frequencies'
)
embedding_column = tf.feature_column.embedding_column(
	weighted_column, dimension=1
)

inputs = {
    'color': tf.keras.Input(name='color', shape=(1,), dtype=tf.string),
    'frequencies': tf.keras.Input(name='frequencies', shape=(1,), dtype=tf.float32)
}
feature_layer = tf.keras.layers.DenseFeatures(embedding_column)
feature_value = feature_layer(inputs)

model = tf.keras.Model(inputs=inputs, outputs=feature_value)

In this code snippet, we specify the dimension=1 to keep the same logic as tf.keras.layers.Dense(1)(feature_value) in Code Snippet 1. However, if we specify the activation function for Dense like tf.keras.layers.Dense(1, activation="relu")(feature_value) in Code Snippet 1, we must use activation function for the output of DenseFeatures with embedding_column which may be tedious.

Code Snippet 4

feature_layer = tf.keras.layers.DenseFeatures(embedding_column)
feature_value = feature_layer(inputs)
relu = tf.keras.activations.relu(feature_value)
model = tf.keras.Model(inputs=inputs, outputs=relu)

How to develop common models with feature column and keras preprocess layers

  1. DNN
  2. Wide And Deep
  3. DeepFM

What should we do to cover the transform functions in SQLFlow

Feature Column

  1. Add a new concat_column for the CONCAT transform function.

Keras Preprocess Layer

  1. The built-in preprocess layer will be released in TF2.2. For the version (< 2.2), we will implement the layers with the same definition. For the version (>= 2.2), we will use the built-in layer directly.
  2. The built-in layers won't cover the CONCAT function. We will provide the layer in ElasticDL pip package.

Integration with SQLFlow Model Zoo