Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader #211

Closed
1 task
karlhigley opened this issue Apr 14, 2022 · 15 comments

Comments

@karlhigley
Copy link
Contributor

karlhigley commented Apr 14, 2022

Tasks

Preview Give feedback

Problem:

Customers need a way to load embeddings that have been pretrained or trained from separate models into the model.
See #471

Goal:

Enable dataloading of separate embedding tables without having to add these embeddings into the interaction data during training. For serving those embeddings need to be provided in the request to the model. The feature must be ueseable in production setting

Constraints:

  • External embedding tables may not fit on GPU.
  • Non-trainable embeddings
  • Fits in CPU memory, Larger than CPU memory is left for potential future work
  • Not generating the embedding on the fly (future work)

Supporting pre-trained vector embeddings as features would provide baseline support for multi-modal use cases that rely on outside models to generate image/text embeddings.

NVTabular

Core

Dataloader

Transformers4Rec

These features under T4R will not be in scope for this RMP ticket. The development will happen in Models.
PR implementing pre-trained support in T4Rec: NVIDIA-Merlin/Transformers4Rec#690

Related PR: NVIDIA-Merlin/Transformers4Rec#690

Models (TF API)

PR #1083 implementing pre-trained support in MM

Merlin Systems

Examples

Documentation

@radekosmulski
Copy link
Contributor

Ok, this issue now makes much more sense to me 🙂 I created a PR NVIDIA-Merlin/models#508 but I think this is just a tiny step on this. Not sure what would be the logical next step here.

I certainly need to continue to bring myself up to speed with Merlin Models, I still only have a very narrow understanding of all the components and how they fit together, but regardless, I wonder what the next steps on this could be? @karlhigley, if you could offer a suggestion, that would be greatly appreciated 🙂 This is my first run-in with an RMP issue

@karlhigley
Copy link
Contributor Author

I'm honestly not entirely sure either! I captured this issue because I heard you were already working on it, but it's mostly a placeholder for a discussion on the scope of what we'd want to do and where that falls in terms of our team priorities. I don't think we've had that conversation yet, and I'm not entirely sure how/where it would happen either (given time zones etc.)

@karlhigley
Copy link
Contributor Author

I put your face on it less to signal that you're responsible for the whole thing (I don't think you are), and more to signal that you'd be the person who is already doing relevant work and probably would have worthwhile thoughts about what we ought to be able to do with pre-trained embeddings.

@radekosmulski
Copy link
Contributor

radekosmulski commented Jun 14, 2022

Thank you very much @karlhigley for these thoughts, they are very helpful! 🙂 Makes a lot of sense.

JUst wanted to reference NVIDIA-Merlin/models#508 -- we now have a use case for using pretrained embeddings, but don't have a good way of freezing them I believe. Would be very good to have this option as this is what likely most users would want.

@rnyak
Copy link
Contributor

rnyak commented Aug 17, 2022

@EvenOldridge @karlhigley we now have an example for using pre-trained embeddings in MMs, and have a way of freezing them. fyi.

@EvenOldridge
Copy link
Member

#471 has details on the customer request side.

@rnyak
Copy link
Contributor

rnyak commented Aug 18, 2022

#471 has details on the customer request side.

@EvenOldridge yes we need this for TF4Rec. And I created this ticket NVIDIA-Merlin/Transformers4Rec#475 for that.

@karlhigley
Copy link
Contributor Author

karlhigley commented Sep 2, 2022

@EvenOldridge If I'm understanding correctly, it sounds like the underlying customer request involves the dataloaders, the T4R library itself, and Merlin Systems (but not NVT.) Would it make sense to scope this issue more tightly to the customer request and punt additional features to a subsequent issue?

@karlhigley
Copy link
Contributor Author

It also sounds like the customer request necessarily involves having PyTorch serving for T4R worked out. Assuming that the (known-to-be-slow) Python serving isn't sufficient, sounds like we'll need to work out the issues with Torchscript serving.

@rhdong
Copy link
Member

rhdong commented Sep 13, 2022

To my best knowledge, TensorFlow has a warm start mechanism as a similar function. I think they have a meaningful design; maybe we can be inspired by it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/warm_starting_util.py#L419
I know some end-users are using these APIs for pre-training, and the regular expression can give the user more convenience.

@bschifferer
Copy link
Contributor

ToDo: How to integrate pre-trained embedding in schema file (tagging) and is used in architecture definition

@karlhigley
Copy link
Contributor Author

karlhigley commented Oct 20, 2022

How to integrate pre-trained embedding in schema file (tagging)

Adding Tags.EMBEDDING as a "prefab" tag in the Merlin Core schema implementation seems like it could make sense 👍🏻

@bschifferer
Copy link
Contributor

bschifferer commented Apr 19, 2023

I am not sure, if the main ticket is uptodate. In some meetings, we say, that the feature is almost done but there are many Tickets not checked (finished). I looked into pre-trained embedding functionality of the dataloader and tried to provide a simple example for a minimal definition of done. That doesn't mean, that this simple example represents the definition of done - that's how I imagine to use this feature.

I did only looked at the TensorFlow side and haven't tested the PyTorch side (assuming it works the same)?

Open ToDos (from my point of view):

  • BUG: Get an key error when combining target with pre-trained embeddings : KeyError: 'target'
  • BUG: Sequence Features are not embedded correctly
  • FEATURE: Convert input columns to emb_ids ( nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) ) - this is similar to a request with have for GTC Recommender. I am not sure, if we want to do this in NVTabular OR if we apply this mapping in the dataloader
  • FEATURE: Merlin Models needs to use pre-trained embedding in the model architecture and use it for training. This should work for ranking models, retrieval models and session-based models. (For special architectures, such as DLRM, it should throw a meaningful error, if the pre-trained embedding do not fit)
  • FEATURE: Transformers4Rec needs to use pre-trained embedding in the model architecture and use it for training.
  • FEATURE: Schema Object needs to represent the pre-trained embedding functionality, that MM and Transformers4Rec knows that this feature is a pre-trained embedding (more below) -> does already exist with dataloader.output_schema
  • [FEATURE: (Not sure if it does already exist) - provide the embeddings during serving]

I will explain more my assumptions and proposed open ToDos:

  1. My assumption is that the user have a downstream process to generate embeddings (np_emb1 and np_emb2). I am not sure, if we can assume that the IDs in the dataset are matching the order of the numpy arrays. I assume there will be mapping tables to convert them (emb1_map and emb1_map). Either in NVT or dataloader, we should provide the functionality to map the input data to the IDs of the pre-trained embeddings.
  2. MM and Transformers4Rec defines the neural network architecture. They rely on the schema object. The current usability to set pre-trained embeddings in the dataloader as transforms do not modify the schema object. Therefore, MM and Transformers4Rec cannot know that they expect pre-trained embeddings. We need to modify the schema object to make this change aware. PROPOSAL (see code comments): We add the information to the schema object (e.g. schema['emb_id_1'].add(PreTrain(np_emb1, lookup_key='emb_id_1', embedding_name='emb_id_1'). It would be great, if we do not need to repeat the information in the dataloader (however, we cannot store the numpy object in the schema, so I guess, we need at least to provide the numpy object to the dataloader).

BUGs:

  • If you uncomment #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET]), next(iter(dataloader)) will fail
import os

os.environ["CUDA_VISIBLE_DEVICES"]="1"

import glob

from merlin.io import Dataset
from merlin.loader.tensorflow import Loader
from merlin.schema import Tags
from merlin.schema.tags import Tags

import numpy as np
import pandas as pd

import nvtabular as nvt
import merlin.models.tf as mm

import cudf

from merlin.dataloader.ops.embeddings import (  # noqa
    EmbeddingOperator,
    MmapNumpyEmbedding,
    NumpyEmbeddingOperator,
)

### Input
np_emb1 = np.random.rand(1000,10)
np_emb2 = np.random.rand(1000,20)
emb1_map = {
    10: 0,
    11: 1,
    12: 2,
    13: 3
}
emb2_map = {
    'a': 0,
    'b': 1,
    'c': 2,
    'd': 3
}
df = cudf.DataFrame({
    'emb_id_1': [10, 12, 11, 12, 11, 13],
    'emb_id_2': ['a', 'd', 'c', 'a', 'd', 'b'],
    'cat1': [1,5,6,3,5,7],
    'cat2': ['a', 'a', 'd', 'e', 'f', 'g'],
    'target': [0,1,1,0,1,0]
})

# NVTabular Workflow
emb1 = ['emb_id_1'] >> nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
emb2 = ['emb_id_2'] >> nvt.ops.LambdaOp(lambda x: x.map(emb2_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
cats = ['cat1', 'cat2'] >> nvt.ops.Categorify()
target = ['target'] #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET])

features = emb1+emb2+cats+target
workflow = nvt.Workflow(features)

ds = Dataset(df)
workflow.fit(ds)
ds_transformed = workflow.transform(ds)
ds_transformed.compute()

data_loader = Loader(
    ds_transformed,
    batch_size=2,
    transforms=[
        NumpyEmbeddingOperator(
            np_emb1,
            lookup_key='emb_id_1',
            embedding_name='emb_id_1'
        ), 
        NumpyEmbeddingOperator(
            np_emb2, 
            lookup_key='emb_id_2',
            embedding_name='emb_id_2'
        )
    ],
    shuffle=False,
)
next(iter(dataloader))
model = mm.Model.from_block(
    mm.MLPBlock([64, 32]),
    data_loader.output_schema, 
    prediction_tasks=mm.BinaryOutput('target')
)
model.compile()
model.fit(data_loader)

Session-Based Bug:
I do not know, if session-based is in the scope (given that Transformers4Rec is mentioned, I guess yes?). Although there are only 2x examples, the emb tensors is [6, 10] - it does not keep the sequential structure. I do not know what the representation is, but I think we might need to convert it to __values and __offsets (and the offsets are missing)?

emb = np.random.rand(1000,10)
df = cudf.DataFrame({
    'idx': [0,1,2,3,4,5,6,7,8,9],
    'id1': [[0, 1], [1,2,3,4],[2],[3],[4],[5],[6],[8],[9],[10]]
})

dataset = Dataset(df)
schema = dataset.schema
for col_name in ['id1']:
    schema[col_name] = schema[col_name].with_tags(Tags.CATEGORICAL)
dataset.schema = schema
embeddings_np = emb
data_loader = Loader(
    dataset,
    batch_size=2,
    transforms=[NumpyEmbeddingOperator(
        embeddings_np, 
        lookup_key='id1',
        embedding_name='emb'
    )],
    shuffle=False,
)
next(iter(data_loader))

@EvenOldridge EvenOldridge changed the title [RMP] Support pre-trained vector embeddings as features [RMP] Support pre-trained vector embeddings as input features into a model Apr 19, 2023
@EvenOldridge EvenOldridge changed the title [RMP] Support pre-trained vector embeddings as input features into a model [RMP] Support pre-trained vector embeddings as input features into a model via the dataloader Apr 19, 2023
@viswa-nvidia
Copy link

@sararb to update this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests