Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DynamicEmbedding RFC #446

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
250 changes: 250 additions & 0 deletions rfcs/20230515-DynamicEmbedding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# DynamicEmbedding layer for Keras

Status | Accepted
:------------ | :-----------------------------------------------------------
**RFC #** | 446
**Author(s)** | Divyashree Sreepathihalli([email protected])
**Sponsor** | Rick Chao ([email protected])
**Updated** | 2023-05-16

## Objective
The objective of this proposal is to introduce the DynamicEmbedding layer to
the Keras ecosystem, providing a native solution for handling
colossal-scale problems in recommendation systems. The proposed solution
facilitates automatic vocabulary building and updates, and dynamic embedding
updates corresponding to evolving input patterns and vocabulary changes.
### Goal
* Works across accelerators (GPU / TPU)
* Works with Parameter server strategy (asynchroous distributed training)
Copy link

@pjannaty pjannaty Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: asynchronous

* The solution requires minimum user code changes
* Works with batched training and streamed training
* Has performance parity with existing training jobs w/o dynamic embedding
### Extended goals
* Works with synchronous distributed training

## Motivation
Recommendation systems and search ranking are crucial in powering the largest
revenue streams, such as PCTR/PCVR and video recommendation. However, as
recommendation models have become more complicated, there are three distinct
challenges that need to be addressed. These include the difficulty in
separating popular and less-popular items or adapting to the seasonal cycle
of popularity, the lack of a cross-platform solution for handling larger
and larger embedding tables the dynamic nature of large embedding tables
due to modeling large unique id-based features and the crossing features
among them.

Currently, there are two ways to handle such limitations in TensorFlow:
direct hashing without a vocabulary
a pre-computed fixed vocab with out-of-vocabulary hashing.
Neither approximation gives the user a fine grained control over
vocab-embedding mapping. Hence, the proposal aims to provide a native
solution for handling these challenges by introducing the concept of
DynamicEmbedding.

### Why Keras?
We believe that internal and external users share many common pain points.
To support these features, external users today often need to rebuild an
entire suite of APIs, including optimizers, distributed training logic,
and customized TF kernels, to work around TensorFlow restrictions (that
variables are special-cased). As the middle layer of the TF tech stack,
we believe that we are in the best position to work with upstream 1P and
3P users, consolidate feedback, collaborate to drive a hardware-agnostic
solution.

## User Benefit
This initiative offers several benefits, including:
Providing a unified TensorFlow solution that allows for productive
exploration and potential large model quality gain across different use
cases.
Reducing computation cost and training latency by eliminating the need
for a pre-computed vocab.
Strengthening TensorFlow's advantage for third-party adoption.(Nvidia,
spotify, Tencent/Alibaba
(RFC: https://github.com/tensorflow/recommenders-addons/blob/master/rfcs/20200424-sparse-domain-isolation.md)
- vip.com case study
(https://drive.google.com/file/d/1UEWtixlA_zucLLkXlmgbF-4DAZHHKNmo/view?resourcekey=0-QXC4KOuQ6_RXuaYiyyRfYQ)

Additionally, many external users that rely on TensorFlow have already
adopted this idea, and open-source libraries have been pushing on this
front - TorchRec with a native embedding distribution and a dynamic
embedding solution & HugeCTR (Merlin) with a highly-performant
embedding caching strategy. This makes it essential for TensorFlow to
introduce a native solution to stay competitive in the market.

## Design Proposal
In this design approach, the DynamicEmbedding layer is composed of two
layers: the DynamicLookup layer and the Embedding layer. The
DynamicLookup layer is responsible for the following tasks:
* Maintaining a vocabulary table using an eviction policy that is
Copy link
Member

@Lifann Lifann Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently using parameter size in volume about 1E13 bytes in production. Will it be very expansive to maintain vocabulary and indexes for large parameter?

updated based on input pattern.
* Performing vocabulary lookup for the given input and returning
integer indexes.
* The index is then passed to the Embedding layer, which looks
Copy link
Member

@Lifann Lifann Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many case, we don't know how many keys in a feature exactly, since the property of videos, images, commodity, video games, etc. are always in change. Preset a vocab/index range may lead to waste in storage or feature conflicts.

up the embedding vector. The Embedding layer is responsible for
the following tasks:
+ Looking up the embedding vector for the given integer index.
+ Returning the embedding vector.
The embedding vector is then used by the subsequent layer in the
neural network. The Dynamic Embedding layer is used in conjunction
with UpdateEmbeddingCallback. The callback is triggered at a
predetermined time interval. It aggregates the Dynamic vocabulary
table across all workers and updates the vocabulary that is used
for input lookup across all workers. This ensures that the vocabulary
is always up-to-date and that all workers are using the same vocabulary.


![DynamicEmbedding](/20230515-DynamicEmbedding/DynamicEmbedding.png)

Here is a deeper look at what is done in DynamicLookup layer and how the
UpdateEmbeddingCallback updates the embeddings and vocabulary
The DynamicEmbedding layer identifies and adds unique keys to the dynamic
vocabulary table for every input passed to it. This table is constantly
updated based on the eviction policy provided, such as TTL, LFU, or LRU.
The table is maintained on each worker when used with distributed
training, and the tables on different workers may be different.
The UpdateEmbeddingCallback is a timed callback that uses a timed
thread to create a callback event when the timer expires. The callback
aggregates the dynamic vocabulary table values across all workers in a
distributed training setup and updates the vocabulary on all workers.
Update the vocab->index mapping(mutable hash table/ tf.Variable) on
all workers Update/remap the embedding matrix to reflect new
vocabulary-> index mapping
* Old vocab keys will have the same embedding vector
* New vocab keys will have newly initialized embedding vector
This updated vocabulary is used for lookup in the DynamicLookup layer
until the callback event is triggered again after the time interval.

![DynamicLookup](0230515-DynamicEmbedding/DynamicLookup.png)

The image below illustrates the workflow when the parameter server
strategy is used. PSS supports asynchronous training. Each worker
will have a copy of the vocabulary, which will be consistent across
Copy link
Member

@rhdong rhdong May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @divyashreepathihalli, may I have your confirmation here? If it means each worker will hold a full set of vocabulary that maps the vocab to index, and the real embedding vectors stored in some PSs with dense format(for example the tf.Variable)? Am I correct? Thank you so much!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Each worker should have a copy of the vocabulary( vocab->index mapping). The embedding variable will be split in distributed servers.

Copy link
Member

@rhdong rhdong May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @divyashreepathihalli, thank you for your comment! If we have a full set copy of the key-index mapping on each worker, there should be some upper limitations on the vocabulary size. To my best knowledge, some vocabulary size in some industrial scenarios can be tens or hundreds of billions, which causes the memory consumption on GPU/TPU to be significantly large and unbearable. One of the practical solutions is storing the key-value in the format of an abstract hashtable in a distributed way like TFRA. Hope it's helpful. Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. The proposed design would be the initial implementation and the distributed KV server would definitely be the way to go going forward.

all the workers. Each worker learns the dynamic vocabulary table
independently. At regular intervals, in the update embedding callback,
the vocabulary table is aggregated from values across all the workers.
The top k vocabulary is extracted and the vocabulary lookup is updated
with these values.

![DynamicEmbedding asynchronous training](0230515-DynamicEmbedding/AsyncTraining.png)

## Performance implications
There are two options to have a mutable data structure to maintain the
dynamic vocabulary table:
* Mutable hash tables
* Variables with dynamic shapes
Here are some additional details about each option:
Mutable hash tables are a type of data structure that allows for quick
lookups of data.
Variables with dynamic shapes are a type of data structure that allows
for variables to have different shapes at different times. This can be
useful for storing data that is constantly changing, such as the
vocabulary of a language. Right now, with parameter server strategy
variables cannot be placed on parameter servers. Mutable hash tables
are always placed on the chief, which could have performance
implications for lookups, inserts, and updates to the vocabulary.
However, if we can add support for the TensorFlow distribute side
to allow per-worker variable creation, this performance implication
can be overcome.

## Dependencies
The proposed feature does not introduce any new dependencies. It is
a stand-alone feature that can be used with any existing TensorFlow
workflow. There is no need to modify any existing code or workflows
to use this feature.

## Engineering Impact
This feature can add a small time overhead to update the dynamic
vocabulary table, but this comes with improved performance of models
and less user intervention to update vocabulary and restart training.
Training can be continuous and with real-time data, and the model
would continuously keep updating its vocabulary. This is beneficial
because it allows the model to learn new input patterns, which can
improve its accuracy and performance. Additionally, it reduces the
amount of time and effort required to maintain the model, as the
user does not need to manually update the vocabulary table or
restart training every time new data is available. These benefits
are particularly valuable in an online learning setting

## Platforms and Environments
* GPU, TPU, CPU
* Asynchronous distributed training
Synchronous distributed training

## Best Practices
The following are the best practices used so far:
* The users need to stop training the model and update the
vocabulary before restarting training.
* The vocabulary that needs to be provided to the model needs
to be generated by the user separately.
The DynamicEmbedding layer is a new layer that enables users to
train a model on a dataset with a dynamic vocabulary. This means
that the vocabulary can change over time without the user having
to stop training the model and update the vocabulary. The layer
is simply used as any other Keras layer. The initial vocabulary
can be provided or the layer will learn the whole vocabulary on
its own.

## Tutorials and Examples
```
from keras.layers import DynamicEmbedding
train_data = np.array([
['a', 'j', 'c', 'd', 'e'],
['a', 'h', 'i', 'j', 'b'],
['i', 'h', 'c', 'j', 'e'],

])
train_labels = np.array([0, 1, 2])
vocab = tf.constant(['a', 'b', 'c', 'd', 'e'])
eviction_policy = 'LFU'
# Define the model
model = keras.models.Sequential([
DynamicEmbedding(
input_dim=5,
output_dim=2,
input_length=5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support the inputs with dynamic shapes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input to the layer can be dynamic but if you are asking if input_dim which would be same as vocabulary size - this is not dynamic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thank you for your answer! I would like to know what the input_dim means. From my understanding, input_dim should be less or equal to the vocabulary size, which is fixed when training going on, is it right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_dim should be vocabulary size

Copy link
Member

@rhdong rhdong May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification! If the input_dim and vocabulary size are not dynamic, some critical scenarios may not be supported. Some industrial scenarios of real dynamic embedding request the algorithm engineers to use uint64_t for the encoded features which has a possible range of [0, std::numeric_limits<uint64_t>::max]. That means the input_dim and vocabulary size should not be set cause it's almost unlimited.

Copy link
Member

@rhdong rhdong May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong, I would like to clarify that for the layer initialization inp_dim is vocabulary size(tried to keep it consistent with the [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=.Embedding(-,input_dim%2C,-output_dim%2C%0A%C2%A0%20%C2%A0%20embeddings_initializer) - . The input to the layer can be of any dynamic shape.

Hi @divyashreepathihalli, thank you for your clarification. I understand now. About The input to the layer can be of any dynamic shape. it total makes sense. But I'm afraid that the input_dim setting would limit the features encoding space. In the dynamic embedding context(compared to the original static embedding in current TensorFlow), the input_dim should be std::numeric_limits<uint64_t>::max. I would try to explain it in one google doc. Before that, possibly you could refer to the TFRA API design that only the embedding_size need to be configured (similar with out_dim) https://github.com/tensorflow/recommenders-addons/blob/master/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py#L117

Copy link

@MoFHeka MoFHeka May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong, I would like to clarify that for the layer initialization inp_dim is vocabulary size(tried to keep it consistent with the [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=.Embedding(-,input_dim%2C,-output_dim%2C%0A%C2%A0%20%C2%A0%20embeddings_initializer) - . The input to the layer can be of any dynamic shape.

Possibly, I think @divyashreepathihalli may a little misunderstand the meaning of dynamic shape embedding. For example, there is a training feature input that are both large-scale and sparse, such as USER_ID. If we apply the vocabulary method to USER_ID, it will only map USER_ID to the dimension of vocabulary size, which is a compression of the information dimension. Since the vocabulary size is fixed, this is still a static embedding. Dynamic embedding means that all inputs can be processed with a non-conflicting method through a hashmap. The size of the dynamic embedding is not fixed and is unpredictable because the USER_ID grows with the growth of the business.

Copy link

@thorneliu thorneliu Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the example of USER_ID by @MoFHeka, in our recommender system, we use user&item crossed features to enhance the accuracy and relevance of our recommendations. By combining multiple features into a unique identifier, we can create a more comprehensive representation of the relationship between users and items, resulting in better recommendations. When using tf.sparse.cross_hash or xxhash, a sparse key in the range of [0, std::numeric_limits<uint64_t>::max] is generated. For such a large-scale and sparse feature, a dynamic size is mandatory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdong @MoFHeka thank you for the clarification. I tried to read up further. If I understand correctly you are looking for a dynamic vocabulary size and a dynamic embedding matrix as well, correct? One that would keep growing?

As of now our scope of work will be limited to maintaining a fixed size vocabulary and fixed embedding size, updating the vocabulary based on inputs received by the layer and eviction policies. The embedding values will be remapped whenever the vocabulary is updated based on input patterns (most frequently seen input, TTL, etc). If the input key is not in the vocab it will be mapped to a default value, however we keep track of these keys and add it to the vocab when the updates are done in the callback(new keys are added in the vocab by kicking out old keys based on eviction policies specified)

Copy link
Member

@rhdong rhdong Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyashreepathihalli It's my pleasure. Considering the practical scenario of dynamic embedding we reached out to, the hashtable-based dynamic vocabulary size would be a fundamental requirement. I guess one of the PROs of your current design is that there is no need to modify the tf.optimizer; that makes sense, but in addition to the considerations we discussed above, I'm also a little worried it will introduce the data consistency issue caused by decoupling the embedding indexing and embedding looking up, especially in eviction involved. Applying atomic or lock mechanisms on ID and embedding is challenging when they are operated in two separate OPs.

eviction_policy=eviction_policy,
initial_vocabulary=vocab,
),
keras.layers.Flatten(),
keras.layers.Dense(3, activation='softmax'),
])

# Compile the model
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
)
update_embedding_callback = dynamic_embedding.UpdateEmbeddingCallback(
model.layers[0],
interval=2,
)
with update_embedding_callback:
result = model.fit(
train_data,
train_labels,
epochs=100,
batch_size=1,
callbacks=[update_embedding_callback],
)
```

## Compatibility
This design is forward and backward compatible. The layer should work with
both synchronous and asynchronous distribution strategies. The model with
DynamicEmbedding can be saved and loaded just like any other keras layer.
The vocabulary will be accessible to users to save and load as well.

## User Impact
Users will be able to access DynamicEmbedding as a new layer in Keras.
An illustration of how to use this layer is shown above.

## Acknowledgement
The [TensorFlow Recommenders Addon project](https://github.com/tensorflow/recommenders-addons/blob/master/docs/api_docs/tfra/dynamic_embedding.md)
maintained by TensorFlow SIG Recommenders is a community-led project that
aims to solve similar issues currently. This RFC is inspired by both
Google internal use cases as well as the TFRA project. We are thankful
for the contributions from TFRA maintainers (in particular, Haidong
Rong from Nvidia) and welcome future collaborations on this RFC.

## Questions and Discussion Topic
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.