-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the clip with radio #68
Comments
Hello, yes you can use the CLIP adaptor and the corresponding tokenizer and text encoder. There is an example on https://github.com/NVlabs/RADIO/blob/main/examples/zero_shot_imagenet.py. |
In addition, here's a minimal pseudocode that should work:
|
Thank you very much for your answer! |
Hello, the model architecture is defined in https://github.com/NVlabs/RADIO/blob/main/radio/radio_model.py however the bulk of the instantiation is performed by the TIMM library, since we use a mostly standard We are contemplating adding an API to fetch intermediate activations in the future. In the meantime, assuming you are using RADIO (not E-RADIO), this can be achieved be re-writing the For example, you might write it as: def forward_features(self, x):
"""Return features from the model."""
features = []
if isinstance(self.model, VisionTransformer):
x = self.model.patch_generator(x)
for blk in self.model.blocks:
x = blk(x)
features.append(self.model.norm(x))
else:
raise ValueError("Only VisionTransformer is supported here")
return features |
Btw, @gheinrich has made support for intermediate activations part of the official API: |
Congrats! What an fantastic work!
But now I am trying to replace CLIP with RADIO in the image-text task. Can RADIO be used with CLIP text encoder directly? If so, are there adaptor codes and weights? Or do I need to training the projection layer?
The text was updated successfully, but these errors were encountered: