Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the clip with radio #68

Open
StuHude opened this issue Jul 2, 2024 · 5 comments
Open

Replace the clip with radio #68

StuHude opened this issue Jul 2, 2024 · 5 comments

Comments

@StuHude
Copy link

StuHude commented Jul 2, 2024

Congrats! What an fantastic work!

But now I am trying to replace CLIP with RADIO in the image-text task. Can RADIO be used with CLIP text encoder directly? If so, are there adaptor codes and weights? Or do I need to training the projection layer?

@gheinrich
Copy link
Collaborator

Hello, yes you can use the CLIP adaptor and the corresponding tokenizer and text encoder. There is an example on https://github.com/NVlabs/RADIO/blob/main/examples/zero_shot_imagenet.py.

@mranzinger
Copy link
Collaborator

In addition, here's a minimal pseudocode that should work:

import torch
import torch.nn.functional as F

model = torch.hub.load('NVlabs/RADIO', 'radio_model', version='radio_v2', adaptor_names='clip')
output = model(images)  # Inputs should have values between 0 and 1
bb_summary, bb_features = output['backbone']
clip_summary, clip_features = output['clip']  # These are the DFN CLIP embeddings

# To get the text embeddings
clip_adaptor = model.adaptors['clip']
tokens = clip_adaptor.tokenizer(['foo', 'bar'])
clip_text_embeddings = clip_adaptor.encode_text(tokens)

# B x B compatibility matrix from each image embedding to each text embedding (e.g. CLIP objective)
alignment = F.normalize(clip_summary, dim=1) @ F.normalize(clip_text_embeddings.T, dim=0)

@StuHude
Copy link
Author

StuHude commented Jul 3, 2024

Thank you very much for your answer!
In addition, I would like to ask if you can release the model structure of RADIO, I hope to get the output of each layer in the model. If possible, it will be of great help to me. Thank you very much!

@gheinrich
Copy link
Collaborator

Hello, the model architecture is defined in https://github.com/NVlabs/RADIO/blob/main/radio/radio_model.py however the bulk of the instantiation is performed by the TIMM library, since we use a mostly standard VisionTransformer model.

We are contemplating adding an API to fetch intermediate activations in the future. In the meantime, assuming you are using RADIO (not E-RADIO), this can be achieved be re-writing the _forward_cpe method in https://github.com/NVlabs/RADIO/blob/main/radio/enable_cpe_support.py.

For example, you might write it as:

    def forward_features(self, x):
        """Return features from the model."""
        features = []

        if isinstance(self.model, VisionTransformer):
            x = self.model.patch_generator(x)

            for blk in self.model.blocks:
                x = blk(x)               
                features.append(self.model.norm(x))

        else:
            raise ValueError("Only VisionTransformer is supported here")

        return features

@mranzinger
Copy link
Collaborator

Btw, @gheinrich has made support for intermediate activations part of the official API:
https://github.com/NVlabs/RADIO?tab=readme-ov-file#intermediate-layer-activations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants