Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)
model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 | Weights |
---|---|---|---|---|---|---|
eva-clip |
1.3B | fp16 |
LAION-400M | 41K | 78.5 | ModelHub Link |
To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance.
For more details of EVA-CLIP, please refer to Section 2.3.5 of paper.
The top-1 accuracy of ImageNet-1K variants and ObjectNet.
model | IN-1K | IN-V2 | IN-Adv. | IN-Ren. | IN-Ske. | ObjectNet |
---|---|---|---|---|---|---|
OpenAI CLIP-L | 75.55 | 69.86 | 70.76 | 87.83 | 59.58 | 68.98 |
Open CLIP-H | 77.96 | 70.87 | 59.33 | 89.33 | 66.58 | 69.71 |
Open CLIP-g | 76.65 | 69.56 | 57.19 | 88.69 | 65.17 | 67.53 |
EVA CLIP-g | 78.53 | 71.52 | 73.59 | 92.5 | 67.31 | 72.33 |
The performance of video action recognition benchmarks.
model | UCF-101 | Kinetics-400 | Kinetics-600 | Kinetics-700 |
---|---|---|---|---|
OpenAI CLIP-L | 76.39 | 64.47 | 64.21 | 57.68 |
Open CLIP-H | 78.16 | 63.06 | 63.58 | 56.09 |
Open CLIP-g | 77.73 | 61.69 | 62.16 | 54.99 |
EVA CLIP-g | 76.05 | 65.23 | 64.38 | 58.4 |
For video action recognition, we sample only a single center frame each video, turning it into an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.
Dataset | Model | Text-to-Image Retrival | Image-to-Text Retrival | ||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
Flickr30k | OpenAI CLIP-L | 65.18 | 87.28 | 92 | 85.2 | 97.3 | 99 |
Open CLIP-H | 77.78 | 94.14 | 96.62 | 90.8 | 99.3 | 99.7 | |
Open CLIP-g | 76.52 | 93.62 | 96.28 | 90.8 | 99.1 | 99.8 | |
EVA CLIP-g | 72.64 | 91.6 | 95.12 | 88.3 | 98.3 | 99.3 | |
MSCOCO | OpenAI CLIP-L | 36.51 | 61.01 | 71.11 | 56.34 | 79.32 | 86.66 |
Open CLIP-H | 49.47 | 73.4 | 81.53 | 65.96 | 86.06 | 91.9 | |
Open CLIP-g | 47.99 | 72.37 | 80.75 | 64.96 | 85.3 | 91.46 | |
EVA CLIP-g | 44.07 | 68.5 | 77.33 | 61.76 | 83.28 | 89.96 |
The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:
- The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e.,
124M
v.s.354M
, and is only~1/8
of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.- Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.
import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
model_name="eva-clip")
model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)
def download_image(url):
urllib_request = urllib.request.Request(
url,
data=None,
headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
)
with urllib.request.urlopen(urllib_request, timeout=10) as r:
img_stream = io.BytesIO(r.read())
return img_stream
def inference():
# local image
# image = Image.open(/path/to/image)
# online image
image = Image.open(download_image("https://bkimg.cdn.bcebos.com/pic/4610b912c8fcc3ce2d02315d9d45d688d53f209a?x-bce-process=image/watermark,image_d2F0ZXIvYmFpa2UxMTY=,g_7,xp_5,yp_5"))
image = transform(image).unsqueeze(0).to(device)
text = tokenizer.tokenize_as_tensor(["a tomato", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
text_probs = (image_features @ text_features.T).softmax(dim=-1)
print(text_probs.cpu().numpy()[0].tolist()) # [1.0, 0.0]
The code below performs zero-shot prediction using EVA_CLIP. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.
import os
import torch
from torchvision.datasets import CIFAR100
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
model_name="eva-clip")
model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)
# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
# Prepare the inputs
image, class_id = cifar100[3637]
image_input = transform(image).unsqueeze(0).to(device)
text_inputs = torch.cat([tokenizer.tokenize_as_tensor(f"a photo of a {c}") for c in cifar100.classes]).to(device)
# Calculate features
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
The output will look like the following (the exact numbers may be slightly different depending on the compute device):
Top predictions:
snake: 100.00%
turtle: 0.00%
caterpillar: 0.00%
worm: 0.00%
leopard: 0.00%
EVA-CLIP is built with OpenAI CLIP, Open CLIP and CLIP Benchmark. Thanks for their awesome works!