-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-reproducible MSRVTT results - I get R@1 accuracy less than 1% #51
Comments
Hi @lennartmoritz, I'm currently using this model for my project and I'm having the same issue with eval_msrvtt.sh. I wrote my own script for model evaluation. Unfortunatelly, FT models does not show the expected results, but Large models are ok (LanguageBind_Video, LanguageBind_Audio) You may try run my script, it gave me around 41.50 R@1, 65.80 R@5, 75.50 R@10 from collections import defaultdict
import torch
import pandas as pd
import numpy as np
from more_itertools import chunked
from tqdm.auto import tqdm
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer
def compute_metrics(x):
sx = np.sort(-x, axis=1)
d = np.diag(-x)
d = d[:, np.newaxis]
ind = sx - d
ind = np.where(ind == 0)
ind = ind[1]
metrics = {}
metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
metrics['MR'] = np.median(ind) + 1
metrics["MedianR"] = metrics['MR']
metrics["MeanR"] = np.mean(ind) + 1
# metrics["cols"] = [int(i) for i in list(ind)]
return metrics
def main():
device = torch.device('cuda:0')
clip_type = {
'video': 'LanguageBind_Video',#_FT', # also LanguageBind_Video
'audio': 'LanguageBind_Audio',#_FT', # also LanguageBind_Audio
# 'image': 'LanguageBind_Image',
# 'thermal': 'LanguageBind_Thermal',
# 'depth': 'LanguageBind_Depth',
}
model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir').to(device)
model.eval()
tokenizer = LanguageBindImageTokenizer.from_pretrained('lb203/LanguageBind_Image', cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}
df = pd.read_csv('../data/MSRVTT/MSRVTT_JSFUSION_test.csv')
language_data = df['sentence'].values.tolist()
video_data = df['video_id'].apply(lambda x: str(f'../data/MSRVTT/videos/all/{x}.mp4')).values.tolist()
def embed(x: list[list], dtypes: list[str]) -> list:
inputs = {}
for data, dtype in zip(x, dtypes):
if dtype == 'language':
inputs['language'] = to_device(tokenizer(data, max_length=77, padding='max_length', truncation=True, return_tensors='pt'), device)
elif dtype in ['image', 'video', 'audio', 'depth', 'thermal', 'language']:
inputs[dtype] = to_device(modality_transform[dtype](data), device)
else:
raise
with torch.no_grad():
embeddings = model(inputs)
embeddings = {k: v.detach().cpu().numpy() for k, v in embeddings.items()}
return embeddings
batch_size = 16
results = defaultdict(lambda: np.random.rand(0, 768))
for batch in tqdm(list(zip(
chunked(language_data, batch_size),
chunked(video_data, batch_size)
))):
embeddings = embed(
batch,
dtypes=['language', 'video']
)
results['language'] = np.concatenate([results['language'], embeddings['language']])
results['video'] = np.concatenate([results['video'], embeddings['video']])
video = results['video']
language = results['language']
np.save('experiments/MSR-VTT_test_video_embeddings.npy', video)
np.save('experiments/MSR-VTT_test_language_embeddings.npy', language)
sim_matrix = torch.tensor(video @ language.T)
print('VT', compute_metrics(sim_matrix))
print('TV', compute_metrics(sim_matrix.T))
if __name__ == '__main__':
main() |
Hey @e1four15f thank you for your code example. In the mean time, i wrote a similar script to yours based on the inference example script from the repo. But i've noticed, that this is considerably slower than when i used the eval script. I suspect it has to do with the used batch sizes. Have you found a way to select a batch size for inference with your script? |
I am trying to verify/reproduce your paper's validation results without training it myself and expected 42.6% R@1 accuracy for MSR-VTT.
But when I follow the instructions from TRAIN_AND_VALIDATE.md (I only did the
eval.sh
, no training) I get results that are as bad as randomly guessing with about 0.1% R@1 accuracy. See myout.log
here:What I need:
Please tell me how i can select your final model for the eval script, which will lead to the same results you that you published.
What I suspect is wrong:
Well, I guess the issue is that I am trying to evaluate the untrained model here instead of your trained version.
Maybe I misunderstood the instructions, and the pretrained weights I downloaded are not the same as your fully trained model described in the paper.
I have also tried to get your final model by running my
eval_msrvtt.sh
script with theTRANSFORMERS_OFFLINE=0
environment variable and an empty cache_dir in hopes of downloading the fully trained version. Strangely enough this leads to slightly different results in myout.log
:How to reproduce:
I follow TRAIN_AND_VALIDATE.md.
eval.sh
and save it aseval_msrvtt.sh
. Then execute the script.This is my eval_msrvtt.sh:
The text was updated successfully, but these errors were encountered: