Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

model name	#param.	precision	data	batch size	IN-1K zero-shot top-1	Weights
`eva-clip`	1.3B	`fp16`	LAION-400M	41K	78.5	ModelHub Link

To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance.

For more details of EVA-CLIP, please refer to Section 2.3.5 of paper.

EVA-CLIP Zero-shot Evaluation Results

Zero-shot Image Classification Evaluation

The top-1 accuracy of ImageNet-1K variants and ObjectNet.

model	IN-1K	IN-V2	IN-Adv.	IN-Ren.	IN-Ske.	ObjectNet
OpenAI CLIP-L	75.55	69.86	70.76	87.83	59.58	68.98
Open CLIP-H	77.96	70.87	59.33	89.33	66.58	69.71
Open CLIP-g	76.65	69.56	57.19	88.69	65.17	67.53
EVA CLIP-g	78.53	71.52	73.59	92.5	67.31	72.33

Zero-shot Video Action Recognition Evaluation

The performance of video action recognition benchmarks.

model	UCF-101	Kinetics-400	Kinetics-600	Kinetics-700
OpenAI CLIP-L	76.39	64.47	64.21	57.68
Open CLIP-H	78.16	63.06	63.58	56.09
Open CLIP-g	77.73	61.69	62.16	54.99
EVA CLIP-g	76.05	65.23	64.38	58.4

For video action recognition, we sample only a single center frame each video, turning it into an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.

Zero-shot Retrieval Evaluation

Dataset	Model	Text-to-Image Retrival			Image-to-Text Retrival
Dataset	Model	R@1	R@5	R@10	R@1	R@5	R@10
Flickr30k	OpenAI CLIP-L	65.18	87.28	92	85.2	97.3	99
	Open CLIP-H	77.78	94.14	96.62	90.8	99.3	99.7
	Open CLIP-g	76.52	93.62	96.28	90.8	99.1	99.8
	EVA CLIP-g	72.64	91.6	95.12	88.3	98.3	99.3
MSCOCO	OpenAI CLIP-L	36.51	61.01	71.11	56.34	79.32	86.66
	Open CLIP-H	49.47	73.4	81.53	65.96	86.06	91.9
	Open CLIP-g	47.99	72.37	80.75	64.96	85.3	91.46
	EVA CLIP-g	44.07	68.5	77.33	61.76	83.28	89.96

The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:

The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e., 124M v.s. 354M, and is only ~1/8 of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.

Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.

Usage

import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
                    model_name="eva-clip")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)

def download_image(url):
    urllib_request = urllib.request.Request(
        url,
        data=None,
        headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
    )
    with urllib.request.urlopen(urllib_request, timeout=10) as r:
        img_stream = io.BytesIO(r.read())
    return img_stream

def inference():
    # local image
    # image = Image.open(/path/to/image)
    # online image
    image = Image.open(download_image("https://bkimg.cdn.bcebos.com/pic/4610b912c8fcc3ce2d02315d9d45d688d53f209a?x-bce-process=image/watermark,image_d2F0ZXIvYmFpa2UxMTY=,g_7,xp_5,yp_5"))
    image = transform(image).unsqueeze(0).to(device)
    text = tokenizer.tokenize_as_tensor(["a tomato", "a cat"]).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        text_probs = (image_features @ text_features.T).softmax(dim=-1)

    print(text_probs.cpu().numpy()[0].tolist()) # [1.0, 0.0]

Zero-Shot Prediction

The code below performs zero-shot prediction using EVA_CLIP. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.

import os
import torch
from torchvision.datasets import CIFAR100
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
                    model_name="eva-clip")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = transform(image).unsqueeze(0).to(device)
text_inputs = torch.cat([tokenizer.tokenize_as_tensor(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

The output will look like the following (the exact numbers may be slightly different depending on the compute device):

Top predictions:

           snake: 100.00%
          turtle: 0.00%
     caterpillar: 0.00%
            worm: 0.00%
         leopard: 0.00%

Acknowledgement

EVA-CLIP is built with OpenAI CLIP, Open CLIP and CLIP Benchmark. Thanks for their awesome works!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

EVA-CLIP Zero-shot Evaluation Results

Zero-shot Image Classification Evaluation

Zero-shot Video Action Recognition Evaluation

Zero-shot Retrieval Evaluation

Usage

Zero-Shot Prediction

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

EVA-CLIP Zero-shot Evaluation Results

Zero-shot Image Classification Evaluation

Zero-shot Video Action Recognition Evaluation

Zero-shot Retrieval Evaluation

Usage

Zero-Shot Prediction

Acknowledgement