Skip to content

TFT Learning Notes

Yi Wang edited this page Feb 17, 2020 · 2 revisions

Starting from the Tutorial

Following this TensorFlow Transform (TFT) tutorial, let us review what we need to do to make our work in Alibaba compatible with the work of the TensorFlow Community. To make the dependencies clear, let us walk through this tutorial in reversed order.

Train, Evaluate, and Export Model

This step of the tutorial shows the following points:

  1. To train the model, we need to create an Estimator variable, which depends on

    1. feature columns,
    2. train_input_fn,
    3. eval_input_fn.
  2. To serve the model, we need to save the trained model, together with the serving_input_fn, into export_model_dir, which depends on the creation of

    1. serving_input_fn.

Wrap our input data in FeatureColumns

This step explains the creation of feature columns.

The get_feature_columns function requires:

  1. tf_transform_output = tft.TFTransformOutput(working_dir), which seems generated by tft_beam.WriteTransformFn(working_dir) [outfile], and
  2. the Python dictionary of column name to type,

For each column in the Python dictionary,

  1. if it is marked float32, the get_feature_columns function calls tf.feature_column.numeric_column
  2. if it is marked string, the get_feature_columns function calls tf.feature_column.categorical_column_with_vocabulary_file, where the vocabulary files came from tf_transform_output.vocabulary_file_by_name, which implies that the training data tranformation writes analysis outputs to the filesystem [outfile]

Questions

  • Should/could we add feature column APIs to handle complex cases like

    • text-encoded tensors 1,2,3;4,5,6,
    • text-encoded sparse tensors 100,9:12;7,16:5,
    • text-encoded one-hot tensors 100,9;7,16,
    • text/JSON-encoded dictionaries I:1;like:1;apples1.
  • [outfile] Is it possible to write transformation outputs to MaxCompute instead of a filesystem? Or, if it is data secured to write to Pangu?

    • tf_transform_output.transformed_feature_spec() in _make_training_input_fn
    • tf_transform_output.transform_raw_features() in _make_serving_input_fn
    • tf_transform_output.vocabulary_file_by_name() in get_feature_columns
  • How widely is TFT used in Google applications? How easy to extend the architecture to handle various cases?

  • tf_transform = tft.TFTransformOutput(working_dir) seems contains the following stuff

    1. transofrmed training and testing data,
    2. analyzer outputs like vocabulary files,
    3. feature specs for using in serving_input_fn,
    4. transform_fn encoded.
  • How to handle data security? Via the filesystem's security?

  • Formats as the input to TFT.

    • How come the Tensor and SparseTensor inputs to preprocessing_fn? beam.io.ReadFromText(train_data_file)

    • We might need to read from MaxCompute.

  • How to control the output of transformed data?

    beam.io.WriteToTFRecord

  • How if I want to write transform_fn to a customized filesystem?

    tft_beam.WriteTransformFn(working_dir)

    • glusterfs and others that can be mounted into Kubernetes containers.
    • Pangu and others that can NOT be mounted.
  • The list of TensorFlow Transform Analyzers

    • How easy it is to extend TensorFlow Transform Analyzers.
    • How to determine bucketization boundaries?
  • How well it is to run Flink on ASI and Kubernetes

    • How well is Beam running on Flink/Spark
  • How does TFT implement analyzers like tft.scale_to_0_1(outputs[key])? Does it use Python syntax parser stuff?

  • How to do data augmentation?

    • In the application of image classification, we need to duplicate each training image into multiple copies -- rotated, scaled, and noised. Does TFT force users to do this before training? Can we do it at training time - to reproduce each data instances into more?

    • If both training and serving needs to get tensors from a lookup service, in addition to transformed data in the filesytem, do we need to extend TFT Transformers?

    Ref: https://www.tensorflow.org/tfx/guide/transform#transform_and_tensorflow_transform

    • Embedding: converting sparse features (like the integer IDs produced by a vocabulary) into dense features by finding a meaningful mapping from high- dimensional space to low dimensional space. See the Embeddings unit in the Machine-learning Crash Course for an introduction to embeddings.
    • Vocabulary generation: converting strings or other non-numeric features into integers by creating a vocabulary that maps each unique value to an ID number.
    • Normalizing values: transforming numeric features so that they all fall within a similar range.
    • Bucketization: converting continuous-valued features into categorical features by assigning values to discrete buckets.
    • Enriching text features: producing features from raw data like tokens, n-grams, entities, sentiment, etc., to enrich the feature set.