[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader #211

karlhigley · 2022-04-14T15:34:04Z

Tasks

Give feedback

Add a draft title or issue reference here
Options

Problem:

Customers need a way to load embeddings that have been pretrained or trained from separate models into the model.
See #471

Goal:

Enable dataloading of separate embedding tables without having to add these embeddings into the interaction data during training. For serving those embeddings need to be provided in the request to the model. The feature must be ueseable in production setting

Constraints:

External embedding tables may not fit on GPU.
Non-trainable embeddings
Fits in CPU memory, Larger than CPU memory is left for potential future work
Not generating the embedding on the fly (future work)

Supporting pre-trained vector embeddings as features would provide baseline support for multi-modal use cases that rely on outside models to generate image/text embeddings.

NVTabular

Refactor Categorify NVTabular#1692
Update the c++ versions of categorify serving to match the new functionality
[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping NVTabular#1748
Enable re-use of a fitted Categorify operator in another workflow #972
NVTabular saving index+embedding tables as a numpy array file #971
Feed pre-trained embeddings to NVTabular
Is this part of this RMP ticket?

Core

Add a tag for pre-trained embeddings core#238

Dataloader

Add lookup for embeddings based on key during dataloading dataloader#31
Add pretrained embedding to the dictionary of tensors dataloader#32
Support for 3d tensors (fixed size of sequence) to support session based in TF4Rec and MM dataloader#34
Modify the padding operator to only allow padding values of 0 (in conjunction with the changes to categorify)

Transformers4Rec

These features under T4R will not be in scope for this RMP ticket. The development will happen in Models.
PR implementing pre-trained support in T4Rec: NVIDIA-Merlin/Transformers4Rec#690

Related PR: NVIDIA-Merlin/Transformers4Rec#690

Models (TF API)

PR #1083 implementing pre-trained support in MM

Merlin Systems

Support pre-trained vector embeddings as features - Add embedding to the recommendation request (from feature store, KV, etc?) systems#210

Examples

Documentation

The text was updated successfully, but these errors were encountered:

radekosmulski · 2022-06-13T04:28:23Z

Ok, this issue now makes much more sense to me 🙂 I created a PR NVIDIA-Merlin/models#508 but I think this is just a tiny step on this. Not sure what would be the logical next step here.

I certainly need to continue to bring myself up to speed with Merlin Models, I still only have a very narrow understanding of all the components and how they fit together, but regardless, I wonder what the next steps on this could be? @karlhigley, if you could offer a suggestion, that would be greatly appreciated 🙂 This is my first run-in with an RMP issue

karlhigley · 2022-06-13T23:12:24Z

I'm honestly not entirely sure either! I captured this issue because I heard you were already working on it, but it's mostly a placeholder for a discussion on the scope of what we'd want to do and where that falls in terms of our team priorities. I don't think we've had that conversation yet, and I'm not entirely sure how/where it would happen either (given time zones etc.)

karlhigley · 2022-06-13T23:14:00Z

I put your face on it less to signal that you're responsible for the whole thing (I don't think you are), and more to signal that you'd be the person who is already doing relevant work and probably would have worthwhile thoughts about what we ought to be able to do with pre-trained embeddings.

radekosmulski · 2022-06-14T01:24:47Z

Thank you very much @karlhigley for these thoughts, they are very helpful! 🙂 Makes a lot of sense.

JUst wanted to reference NVIDIA-Merlin/models#508 -- we now have a use case for using pretrained embeddings, but don't have a good way of freezing them I believe. Would be very good to have this option as this is what likely most users would want.

rnyak · 2022-08-17T16:33:08Z

@EvenOldridge @karlhigley we now have an example for using pre-trained embeddings in MMs, and have a way of freezing them. fyi.

EvenOldridge · 2022-08-17T16:58:57Z

#471 has details on the customer request side.

rnyak · 2022-08-18T17:29:52Z

#471 has details on the customer request side.

@EvenOldridge yes we need this for TF4Rec. And I created this ticket NVIDIA-Merlin/Transformers4Rec#475 for that.

karlhigley · 2022-09-02T21:20:31Z

@EvenOldridge If I'm understanding correctly, it sounds like the underlying customer request involves the dataloaders, the T4R library itself, and Merlin Systems (but not NVT.) Would it make sense to scope this issue more tightly to the customer request and punt additional features to a subsequent issue?

karlhigley · 2022-09-02T21:21:46Z

It also sounds like the customer request necessarily involves having PyTorch serving for T4R worked out. Assuming that the (known-to-be-slow) Python serving isn't sufficient, sounds like we'll need to work out the issues with Torchscript serving.

rhdong · 2022-09-13T00:37:21Z

To my best knowledge, TensorFlow has a warm start mechanism as a similar function. I think they have a meaningful design; maybe we can be inspired by it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/warm_starting_util.py#L419
I know some end-users are using these APIs for pre-training, and the regular expression can give the user more convenience.

bschifferer · 2022-10-17T21:00:15Z

ToDo: How to integrate pre-trained embedding in schema file (tagging) and is used in architecture definition

karlhigley · 2022-10-20T14:44:37Z

How to integrate pre-trained embedding in schema file (tagging)

Adding Tags.EMBEDDING as a "prefab" tag in the Merlin Core schema implementation seems like it could make sense 👍🏻

bschifferer · 2023-04-19T12:50:23Z

I am not sure, if the main ticket is uptodate. In some meetings, we say, that the feature is almost done but there are many Tickets not checked (finished). I looked into pre-trained embedding functionality of the dataloader and tried to provide a simple example for a minimal definition of done. That doesn't mean, that this simple example represents the definition of done - that's how I imagine to use this feature.

I did only looked at the TensorFlow side and haven't tested the PyTorch side (assuming it works the same)?

Open ToDos (from my point of view):

BUG: Get an key error when combining target with pre-trained embeddings : KeyError: 'target'
BUG: Sequence Features are not embedded correctly
FEATURE: Convert input columns to emb_ids ( nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) ) - this is similar to a request with have for GTC Recommender. I am not sure, if we want to do this in NVTabular OR if we apply this mapping in the dataloader
FEATURE: Merlin Models needs to use pre-trained embedding in the model architecture and use it for training. This should work for ranking models, retrieval models and session-based models. (For special architectures, such as DLRM, it should throw a meaningful error, if the pre-trained embedding do not fit)
FEATURE: Transformers4Rec needs to use pre-trained embedding in the model architecture and use it for training.
~~FEATURE: Schema Object needs to represent the pre-trained embedding functionality, that MM and Transformers4Rec knows that this feature is a pre-trained embedding (more below)~~ -> does already exist with dataloader.output_schema
[FEATURE: (Not sure if it does already exist) - provide the embeddings during serving]

I will explain more my assumptions and proposed open ToDos:

My assumption is that the user have a downstream process to generate embeddings (np_emb1 and np_emb2). I am not sure, if we can assume that the IDs in the dataset are matching the order of the numpy arrays. I assume there will be mapping tables to convert them (emb1_map and emb1_map). Either in NVT or dataloader, we should provide the functionality to map the input data to the IDs of the pre-trained embeddings.
MM and Transformers4Rec defines the neural network architecture. They rely on the schema object. The current usability to set pre-trained embeddings in the dataloader as transforms do not modify the schema object. Therefore, MM and Transformers4Rec cannot know that they expect pre-trained embeddings. We need to modify the schema object to make this change aware. PROPOSAL (see code comments): We add the information to the schema object (e.g. schema['emb_id_1'].add(PreTrain(np_emb1, lookup_key='emb_id_1', embedding_name='emb_id_1'). It would be great, if we do not need to repeat the information in the dataloader (however, we cannot store the numpy object in the schema, so I guess, we need at least to provide the numpy object to the dataloader).

BUGs:

If you uncomment #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET]), next(iter(dataloader)) will fail

import os

os.environ["CUDA_VISIBLE_DEVICES"]="1"

import glob

from merlin.io import Dataset
from merlin.loader.tensorflow import Loader
from merlin.schema import Tags
from merlin.schema.tags import Tags

import numpy as np
import pandas as pd

import nvtabular as nvt
import merlin.models.tf as mm

import cudf

from merlin.dataloader.ops.embeddings import (  # noqa
    EmbeddingOperator,
    MmapNumpyEmbedding,
    NumpyEmbeddingOperator,
)

### Input
np_emb1 = np.random.rand(1000,10)
np_emb2 = np.random.rand(1000,20)
emb1_map = {
    10: 0,
    11: 1,
    12: 2,
    13: 3
}
emb2_map = {
    'a': 0,
    'b': 1,
    'c': 2,
    'd': 3
}
df = cudf.DataFrame({
    'emb_id_1': [10, 12, 11, 12, 11, 13],
    'emb_id_2': ['a', 'd', 'c', 'a', 'd', 'b'],
    'cat1': [1,5,6,3,5,7],
    'cat2': ['a', 'a', 'd', 'e', 'f', 'g'],
    'target': [0,1,1,0,1,0]
})

# NVTabular Workflow
emb1 = ['emb_id_1'] >> nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
emb2 = ['emb_id_2'] >> nvt.ops.LambdaOp(lambda x: x.map(emb2_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
cats = ['cat1', 'cat2'] >> nvt.ops.Categorify()
target = ['target'] #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET])

features = emb1+emb2+cats+target
workflow = nvt.Workflow(features)

ds = Dataset(df)
workflow.fit(ds)
ds_transformed = workflow.transform(ds)
ds_transformed.compute()

data_loader = Loader(
    ds_transformed,
    batch_size=2,
    transforms=[
        NumpyEmbeddingOperator(
            np_emb1,
            lookup_key='emb_id_1',
            embedding_name='emb_id_1'
        ), 
        NumpyEmbeddingOperator(
            np_emb2, 
            lookup_key='emb_id_2',
            embedding_name='emb_id_2'
        )
    ],
    shuffle=False,
)
next(iter(dataloader))
model = mm.Model.from_block(
    mm.MLPBlock([64, 32]),
    data_loader.output_schema, 
    prediction_tasks=mm.BinaryOutput('target')
)
model.compile()
model.fit(data_loader)

Session-Based Bug:
I do not know, if session-based is in the scope (given that Transformers4Rec is mentioned, I guess yes?). Although there are only 2x examples, the emb tensors is [6, 10] - it does not keep the sequential structure. I do not know what the representation is, but I think we might need to convert it to __values and __offsets (and the offsets are missing)?

emb = np.random.rand(1000,10)
df = cudf.DataFrame({
    'idx': [0,1,2,3,4,5,6,7,8,9],
    'id1': [[0, 1], [1,2,3,4],[2],[3],[4],[5],[6],[8],[9],[10]]
})

dataset = Dataset(df)
schema = dataset.schema
for col_name in ['id1']:
    schema[col_name] = schema[col_name].with_tags(Tags.CATEGORICAL)
dataset.schema = schema
embeddings_np = emb
data_loader = Loader(
    dataset,
    batch_size=2,
    transforms=[NumpyEmbeddingOperator(
        embeddings_np, 
        lookup_key='id1',
        embedding_name='emb'
    )],
    shuffle=False,
)
next(iter(data_loader))

viswa-nvidia · 2023-05-02T17:03:45Z

@sararb to update this ticket

karlhigley added roadmap epic labels Apr 14, 2022

bschifferer mentioned this issue May 9, 2022

[Task] Example for using pre-trained embeddings NVIDIA-Merlin/models#421

Closed

karlhigley assigned radekosmulski May 17, 2022

radekosmulski mentioned this issue Jun 14, 2022

add usecase with pretrained embeddings NVIDIA-Merlin/models#508

Merged

EvenOldridge added this to the Merlin 22.10 milestone Aug 3, 2022

EvenOldridge assigned benfred and jperez999 Aug 3, 2022

EvenOldridge closed this as completed Aug 3, 2022

EvenOldridge reopened this Aug 17, 2022

karlhigley mentioned this issue Aug 18, 2022

[FEA] Support feeding pre-trained embeddings to TF4Rec model with high-level api NVIDIA-Merlin/Transformers4Rec#475

Open

3 tasks

viswa-nvidia modified the milestones: Merlin 22.10, Merlin 22.11 Aug 29, 2022

viswa-nvidia assigned edknv Oct 4, 2022

viswa-nvidia modified the milestones: Merlin 22.11, Merlin 22.12 Oct 25, 2022

viswa-nvidia modified the milestones: Merlin 22.12, Merlin 23.01 Nov 15, 2022

viswa-nvidia added 23.01 22.12 22.11 and removed epic labels Dec 15, 2022

viswa-nvidia modified the milestones: Merlin 23.01, Merlin 23.02 Dec 20, 2022

karlhigley mentioned this issue Dec 21, 2022

[INF] Merlin Commons #776

Open

11 tasks

viswa-nvidia modified the milestones: Merlin 23.02, Merlin 23.03 Jan 24, 2023

oliverholworthy mentioned this issue Mar 9, 2023

Add tag to describe column containing embeddings NVIDIA-Merlin/core#239

Merged

viswa-nvidia modified the milestones: Merlin 23.03, Merlin 23.04, Merlin 23.05 Mar 14, 2023

EvenOldridge mentioned this issue Apr 12, 2023

[RMP] Update GTC Recommender to leverage Merlin Systems and new Merlin capabilities #887

Open

10 tasks

oliverholworthy mentioned this issue Apr 19, 2023

Enable transforms to work when targets are output from dataloader NVIDIA-Merlin/dataloader#137

Merged

EvenOldridge changed the title ~~[RMP] Support pre-trained vector embeddings as features~~ [RMP] Support pre-trained vector embeddings as input features into a model Apr 19, 2023

EvenOldridge changed the title ~~[RMP] Support pre-trained vector embeddings as input features into a model~~ [RMP] Support pre-trained vector embeddings as input features into a model via the dataloader Apr 19, 2023

gabrielspmoreira mentioned this issue Apr 25, 2023

[Task] Support pre-trained embeddings for ranking and session based models via the new dataloader functionality NVIDIA-Merlin/models#1043

Closed

6 tasks

EvenOldridge modified the milestones: Merlin 23.05, Merlin 23.06 Apr 26, 2023

rnyak mentioned this issue May 1, 2023

Add an example of a pre-trained embeddings support to TabularSequenceFeatures and TabularFeatures NVIDIA-Merlin/Transformers4Rec#558

Closed

gabrielspmoreira mentioned this issue May 4, 2023

Add dataloader pre-trained embeddings support to Merlin Models NVIDIA-Merlin/models#1083

Merged

5 tasks

viswa-nvidia modified the milestones: Merlin 23.05, Merlin 23.06 May 16, 2023

EvenOldridge closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader #211

[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader #211

karlhigley commented Apr 14, 2022 •

edited by EvenOldridge

Loading

Tasks

radekosmulski commented Jun 13, 2022

karlhigley commented Jun 13, 2022

karlhigley commented Jun 13, 2022

radekosmulski commented Jun 14, 2022 •

edited

Loading

rnyak commented Aug 17, 2022 •

edited

Loading

EvenOldridge commented Aug 17, 2022

rnyak commented Aug 18, 2022

karlhigley commented Sep 2, 2022 •

edited

Loading

karlhigley commented Sep 2, 2022

rhdong commented Sep 13, 2022 •

edited

Loading

bschifferer commented Oct 17, 2022

karlhigley commented Oct 20, 2022 •

edited

Loading

bschifferer commented Apr 19, 2023 •

edited

Loading

viswa-nvidia commented May 2, 2023

[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader #211

[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader #211

Comments

karlhigley commented Apr 14, 2022 • edited by EvenOldridge Loading

Tasks

Problem:

Goal:

Constraints:

NVTabular

Core

Dataloader

Transformers4Rec

Models (TF API)

Merlin Systems

Examples

Documentation

radekosmulski commented Jun 13, 2022

karlhigley commented Jun 13, 2022

karlhigley commented Jun 13, 2022

radekosmulski commented Jun 14, 2022 • edited Loading

rnyak commented Aug 17, 2022 • edited Loading

EvenOldridge commented Aug 17, 2022

rnyak commented Aug 18, 2022

karlhigley commented Sep 2, 2022 • edited Loading

karlhigley commented Sep 2, 2022

rhdong commented Sep 13, 2022 • edited Loading

bschifferer commented Oct 17, 2022

karlhigley commented Oct 20, 2022 • edited Loading

bschifferer commented Apr 19, 2023 • edited Loading

viswa-nvidia commented May 2, 2023

karlhigley commented Apr 14, 2022 •

edited by EvenOldridge

Loading

radekosmulski commented Jun 14, 2022 •

edited

Loading

rnyak commented Aug 17, 2022 •

edited

Loading

karlhigley commented Sep 2, 2022 •

edited

Loading

rhdong commented Sep 13, 2022 •

edited

Loading

karlhigley commented Oct 20, 2022 •

edited

Loading

bschifferer commented Apr 19, 2023 •

edited

Loading