-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DynamicEmbedding RFC #446
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
# DynamicEmbedding layer for Keras | ||
|
||
Status | Accepted | ||
:------------ | :----------------------------------------------------------- | ||
**RFC #** | 446 | ||
**Author(s)** | Divyashree Sreepathihalli([email protected]) | ||
**Sponsor** | Rick Chao ([email protected]) | ||
**Updated** | 2023-05-16 | ||
|
||
## Objective | ||
The objective of this proposal is to introduce the DynamicEmbedding layer to | ||
the Keras ecosystem, providing a native solution for handling | ||
colossal-scale problems in recommendation systems. The proposed solution | ||
facilitates automatic vocabulary building and updates, and dynamic embedding | ||
updates corresponding to evolving input patterns and vocabulary changes. | ||
### Goal | ||
* Works across accelerators (GPU / TPU) | ||
* Works with Parameter server strategy (asynchroous distributed training) | ||
* The solution requires minimum user code changes | ||
* Works with batched training and streamed training | ||
* Has performance parity with existing training jobs w/o dynamic embedding | ||
### Extended goals | ||
* Works with synchronous distributed training | ||
|
||
## Motivation | ||
Recommendation systems and search ranking are crucial in powering the largest | ||
revenue streams, such as PCTR/PCVR and video recommendation. However, as | ||
recommendation models have become more complicated, there are three distinct | ||
challenges that need to be addressed. These include the difficulty in | ||
separating popular and less-popular items or adapting to the seasonal cycle | ||
of popularity, the lack of a cross-platform solution for handling larger | ||
and larger embedding tables the dynamic nature of large embedding tables | ||
due to modeling large unique id-based features and the crossing features | ||
among them. | ||
|
||
Currently, there are two ways to handle such limitations in TensorFlow: | ||
direct hashing without a vocabulary | ||
a pre-computed fixed vocab with out-of-vocabulary hashing. | ||
Neither approximation gives the user a fine grained control over | ||
vocab-embedding mapping. Hence, the proposal aims to provide a native | ||
solution for handling these challenges by introducing the concept of | ||
DynamicEmbedding. | ||
|
||
### Why Keras? | ||
We believe that internal and external users share many common pain points. | ||
To support these features, external users today often need to rebuild an | ||
entire suite of APIs, including optimizers, distributed training logic, | ||
and customized TF kernels, to work around TensorFlow restrictions (that | ||
variables are special-cased). As the middle layer of the TF tech stack, | ||
we believe that we are in the best position to work with upstream 1P and | ||
3P users, consolidate feedback, collaborate to drive a hardware-agnostic | ||
solution. | ||
|
||
## User Benefit | ||
This initiative offers several benefits, including: | ||
Providing a unified TensorFlow solution that allows for productive | ||
exploration and potential large model quality gain across different use | ||
cases. | ||
Reducing computation cost and training latency by eliminating the need | ||
for a pre-computed vocab. | ||
Strengthening TensorFlow's advantage for third-party adoption.(Nvidia, | ||
spotify, Tencent/Alibaba | ||
(RFC: https://github.com/tensorflow/recommenders-addons/blob/master/rfcs/20200424-sparse-domain-isolation.md) | ||
- vip.com case study | ||
(https://drive.google.com/file/d/1UEWtixlA_zucLLkXlmgbF-4DAZHHKNmo/view?resourcekey=0-QXC4KOuQ6_RXuaYiyyRfYQ) | ||
|
||
Additionally, many external users that rely on TensorFlow have already | ||
adopted this idea, and open-source libraries have been pushing on this | ||
front - TorchRec with a native embedding distribution and a dynamic | ||
embedding solution & HugeCTR (Merlin) with a highly-performant | ||
embedding caching strategy. This makes it essential for TensorFlow to | ||
introduce a native solution to stay competitive in the market. | ||
|
||
## Design Proposal | ||
In this design approach, the DynamicEmbedding layer is composed of two | ||
layers: the DynamicLookup layer and the Embedding layer. The | ||
DynamicLookup layer is responsible for the following tasks: | ||
* Maintaining a vocabulary table using an eviction policy that is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are currently using parameter size in volume about 1E13 bytes in production. Will it be very expansive to maintain vocabulary and indexes for large parameter? |
||
updated based on input pattern. | ||
* Performing vocabulary lookup for the given input and returning | ||
integer indexes. | ||
* The index is then passed to the Embedding layer, which looks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In many case, we don't know how many keys in a feature exactly, since the property of videos, images, commodity, video games, etc. are always in change. Preset a vocab/index range may lead to waste in storage or feature conflicts. |
||
up the embedding vector. The Embedding layer is responsible for | ||
the following tasks: | ||
+ Looking up the embedding vector for the given integer index. | ||
+ Returning the embedding vector. | ||
The embedding vector is then used by the subsequent layer in the | ||
neural network. The Dynamic Embedding layer is used in conjunction | ||
with UpdateEmbeddingCallback. The callback is triggered at a | ||
predetermined time interval. It aggregates the Dynamic vocabulary | ||
table across all workers and updates the vocabulary that is used | ||
for input lookup across all workers. This ensures that the vocabulary | ||
is always up-to-date and that all workers are using the same vocabulary. | ||
|
||
|
||
![DynamicEmbedding](/20230515-DynamicEmbedding/DynamicEmbedding.png) | ||
|
||
Here is a deeper look at what is done in DynamicLookup layer and how the | ||
UpdateEmbeddingCallback updates the embeddings and vocabulary | ||
The DynamicEmbedding layer identifies and adds unique keys to the dynamic | ||
vocabulary table for every input passed to it. This table is constantly | ||
updated based on the eviction policy provided, such as TTL, LFU, or LRU. | ||
The table is maintained on each worker when used with distributed | ||
training, and the tables on different workers may be different. | ||
The UpdateEmbeddingCallback is a timed callback that uses a timed | ||
thread to create a callback event when the timer expires. The callback | ||
aggregates the dynamic vocabulary table values across all workers in a | ||
distributed training setup and updates the vocabulary on all workers. | ||
Update the vocab->index mapping(mutable hash table/ tf.Variable) on | ||
all workers Update/remap the embedding matrix to reflect new | ||
vocabulary-> index mapping | ||
* Old vocab keys will have the same embedding vector | ||
* New vocab keys will have newly initialized embedding vector | ||
This updated vocabulary is used for lookup in the DynamicLookup layer | ||
until the callback event is triggered again after the time interval. | ||
|
||
![DynamicLookup](0230515-DynamicEmbedding/DynamicLookup.png) | ||
|
||
The image below illustrates the workflow when the parameter server | ||
strategy is used. PSS supports asynchronous training. Each worker | ||
will have a copy of the vocabulary, which will be consistent across | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @divyashreepathihalli, may I have your confirmation here? If it means each worker will hold a full set of vocabulary that maps the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is correct. Each worker should have a copy of the vocabulary( vocab->index mapping). The embedding variable will be split in distributed servers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @divyashreepathihalli, thank you for your comment! If we have a full set copy of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with you. The proposed design would be the initial implementation and the distributed KV server would definitely be the way to go going forward. |
||
all the workers. Each worker learns the dynamic vocabulary table | ||
independently. At regular intervals, in the update embedding callback, | ||
the vocabulary table is aggregated from values across all the workers. | ||
The top k vocabulary is extracted and the vocabulary lookup is updated | ||
with these values. | ||
|
||
![DynamicEmbedding asynchronous training](0230515-DynamicEmbedding/AsyncTraining.png) | ||
|
||
## Performance implications | ||
There are two options to have a mutable data structure to maintain the | ||
dynamic vocabulary table: | ||
* Mutable hash tables | ||
* Variables with dynamic shapes | ||
Here are some additional details about each option: | ||
Mutable hash tables are a type of data structure that allows for quick | ||
lookups of data. | ||
Variables with dynamic shapes are a type of data structure that allows | ||
for variables to have different shapes at different times. This can be | ||
useful for storing data that is constantly changing, such as the | ||
vocabulary of a language. Right now, with parameter server strategy | ||
variables cannot be placed on parameter servers. Mutable hash tables | ||
are always placed on the chief, which could have performance | ||
implications for lookups, inserts, and updates to the vocabulary. | ||
However, if we can add support for the TensorFlow distribute side | ||
to allow per-worker variable creation, this performance implication | ||
can be overcome. | ||
|
||
## Dependencies | ||
The proposed feature does not introduce any new dependencies. It is | ||
a stand-alone feature that can be used with any existing TensorFlow | ||
workflow. There is no need to modify any existing code or workflows | ||
to use this feature. | ||
|
||
## Engineering Impact | ||
This feature can add a small time overhead to update the dynamic | ||
vocabulary table, but this comes with improved performance of models | ||
and less user intervention to update vocabulary and restart training. | ||
Training can be continuous and with real-time data, and the model | ||
would continuously keep updating its vocabulary. This is beneficial | ||
because it allows the model to learn new input patterns, which can | ||
improve its accuracy and performance. Additionally, it reduces the | ||
amount of time and effort required to maintain the model, as the | ||
user does not need to manually update the vocabulary table or | ||
restart training every time new data is available. These benefits | ||
are particularly valuable in an online learning setting | ||
|
||
## Platforms and Environments | ||
* GPU, TPU, CPU | ||
* Asynchronous distributed training | ||
Synchronous distributed training | ||
|
||
## Best Practices | ||
The following are the best practices used so far: | ||
* The users need to stop training the model and update the | ||
vocabulary before restarting training. | ||
* The vocabulary that needs to be provided to the model needs | ||
to be generated by the user separately. | ||
The DynamicEmbedding layer is a new layer that enables users to | ||
train a model on a dataset with a dynamic vocabulary. This means | ||
that the vocabulary can change over time without the user having | ||
to stop training the model and update the vocabulary. The layer | ||
is simply used as any other Keras layer. The initial vocabulary | ||
can be provided or the layer will learn the whole vocabulary on | ||
its own. | ||
|
||
## Tutorials and Examples | ||
``` | ||
from keras.layers import DynamicEmbedding | ||
train_data = np.array([ | ||
['a', 'j', 'c', 'd', 'e'], | ||
['a', 'h', 'i', 'j', 'b'], | ||
['i', 'h', 'c', 'j', 'e'], | ||
|
||
]) | ||
train_labels = np.array([0, 1, 2]) | ||
vocab = tf.constant(['a', 'b', 'c', 'd', 'e']) | ||
eviction_policy = 'LFU' | ||
# Define the model | ||
model = keras.models.Sequential([ | ||
DynamicEmbedding( | ||
input_dim=5, | ||
output_dim=2, | ||
input_length=5, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we support the inputs with dynamic shapes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. input to the layer can be dynamic but if you are asking if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excellent, thank you for your answer! I would like to know what the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. input_dim should be vocabulary size There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the clarification! If the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hi @divyashreepathihalli, thank you for your clarification. I understand now. About There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Possibly, I think @divyashreepathihalli may a little misunderstand the meaning of dynamic shape embedding. For example, there is a training feature input that are both large-scale and sparse, such as USER_ID. If we apply the vocabulary method to USER_ID, it will only map USER_ID to the dimension of vocabulary size, which is a compression of the information dimension. Since the vocabulary size is fixed, this is still a static embedding. Dynamic embedding means that all inputs can be processed with a non-conflicting method through a hashmap. The size of the dynamic embedding is not fixed and is unpredictable because the USER_ID grows with the growth of the business. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Besides the example of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rhdong @MoFHeka thank you for the clarification. I tried to read up further. If I understand correctly you are looking for a dynamic vocabulary size and a dynamic embedding matrix as well, correct? One that would keep growing? As of now our scope of work will be limited to maintaining a fixed size vocabulary and fixed embedding size, updating the vocabulary based on inputs received by the layer and eviction policies. The embedding values will be remapped whenever the vocabulary is updated based on input patterns (most frequently seen input, TTL, etc). If the input key is not in the vocab it will be mapped to a default value, however we keep track of these keys and add it to the vocab when the updates are done in the callback(new keys are added in the vocab by kicking out old keys based on eviction policies specified) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @divyashreepathihalli It's my pleasure. Considering the practical scenario of dynamic embedding we reached out to, the hashtable-based dynamic vocabulary size would be a fundamental requirement. I guess one of the PROs of your current design is that there is no need to modify the |
||
eviction_policy=eviction_policy, | ||
initial_vocabulary=vocab, | ||
), | ||
keras.layers.Flatten(), | ||
keras.layers.Dense(3, activation='softmax'), | ||
]) | ||
|
||
# Compile the model | ||
model.compile( | ||
optimizer='adam', | ||
loss='sparse_categorical_crossentropy', | ||
metrics=['accuracy'], | ||
) | ||
update_embedding_callback = dynamic_embedding.UpdateEmbeddingCallback( | ||
model.layers[0], | ||
interval=2, | ||
) | ||
with update_embedding_callback: | ||
result = model.fit( | ||
train_data, | ||
train_labels, | ||
epochs=100, | ||
batch_size=1, | ||
callbacks=[update_embedding_callback], | ||
) | ||
``` | ||
|
||
## Compatibility | ||
This design is forward and backward compatible. The layer should work with | ||
both synchronous and asynchronous distribution strategies. The model with | ||
DynamicEmbedding can be saved and loaded just like any other keras layer. | ||
The vocabulary will be accessible to users to save and load as well. | ||
|
||
## User Impact | ||
Users will be able to access DynamicEmbedding as a new layer in Keras. | ||
An illustration of how to use this layer is shown above. | ||
|
||
## Acknowledgement | ||
The [TensorFlow Recommenders Addon project](https://github.com/tensorflow/recommenders-addons/blob/master/docs/api_docs/tfra/dynamic_embedding.md) | ||
maintained by TensorFlow SIG Recommenders is a community-led project that | ||
aims to solve similar issues currently. This RFC is inspired by both | ||
Google internal use cases as well as the TFRA project. We are thankful | ||
for the contributions from TFRA maintainers (in particular, Haidong | ||
Rong from Nvidia) and welcome future collaborations on this RFC. | ||
|
||
## Questions and Discussion Topic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: asynchronous