-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eval_func #13
Comments
Hi @tinaboya2023, thanks for your reply. Of course, the metric used for evaluation is not correct, since at the time when I was implementing it, I had no clue about metric evaluation. I have found ANLS, and I will include it shortly in the scripts Currently, you know, just life. Could you help me, with the scripts. I wanted to ask you, if the below code for ANLS is correct? ## ANLS Calculations
## Ref: https://github.com/huggingface/evaluate/pull/413/files
_CITATION = """\
@article{,
title = {Binary codes capable of correcting deletions, insertions, and reversals},
journal = {Soviet physics doklady},
volume = {10},
number = {8},
pages = {707--710},
year = {1966},
url = {https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf},
author = {V. I. Levenshtein},
}
"""
_DESCRIPTION = """\
ANLS refer to the average normalized Levenshtein similarity.
"""
_KWARGS_DESCRIPTION = """
Computes Average Normalized Levenshtein Similarity (ANLS).
Args:
predictions: List of question-answers dictionaries with the following key-values:
- 'question_id': id of the question-answer pair as given in the references (see below)
- 'prediction_text': the text of the answer
references: List of question-answers dictionaries with the following key-values:
- 'question_id': id of the question-answer pair (see above),
- 'answers': list of possible texts for the answer, as a list of strings
Returns:
'anls': The ANLS score of predicted tokens versus the gold answer
Examples:
>>> predictions = [{'prediction_text': 'Denver Broncos', 'question_id': '56e10a3be3433e1400422b22'}]
>>> references = [{'answers': ['Denver Broncos', 'Denver R. Broncos'], 'question_id': '56e10a3be3433e1400422b22'}]
>>> anls_metric = evaluate.load("anls")
>>> results = anls_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'anls_score': 100.0}
"""
def compute_score(predictions, ground_truths):
theta = 0.5
anls_score = 0
count = 0
for qid, prediction in predictions.items():
max_value = 0
if qid in ground_truths:
for x in ground_truths[qid]:
nl = 1 - ratio(prediction, x)
if nl < theta:
score = 1 - nl
if score > max_value:
max_value = score
count += 1
anls_score += max_value
return anls_score / count
class Anls(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": {"question_id": datasets.Value("string"),
"prediction_text": datasets.Value("string")},
"references": {
"question_id": datasets.Value("string"),
"answers": datasets.features.Sequence(datasets.Value("string")),
},
}
)
)
def _compute(self, predictions, references):
ground_truths = {x['question_id']: x['answers'] for x in references}
predictions = {x['question_id']: x['prediction_text'] for x in predictions}
anls_score = compute_score(predictions=predictions, ground_truths=ground_truths)
return {"anls_score": anls_score}``` |
Hi again, I'm so sorry for delay reply. But I think according M4C , you should add EvalAIAnswerProcessor then according TextVQAAccuracy class you can use this metrics. `from m4c_evaluators import TextVQAAccuracyEvaluator def training_step(self, batch, batch_idx):
` |
Hi @tinaboya2023,I tried to make some changes to the evaluation method according to your idea, but the effect was poor. Could you please help me to check the problems in my method?May be I'm wrong. from m4c_evaluators import TextVQAAccuracyEvaluator
evaluator = TextVQAAccuracyEvaluator()
predictions = []
def calculate_acc_score(pred, gt):
## Function ignores the calculation of padding part
## Shape (seq_len, seq_len)
mask = torch.clamp(gt, min = 0, max = 1)
last_non_zero_argument = (mask != 0).nonzero()[1][-1]
pred = pred[:last_non_zero_argument]
gt = gt[:last_non_zero_argument] ## Include all the arguments till the first padding index
pred_answer = convert_token_to_ques(pred, tokenizer)
gt_answer = convert_token_to_ques(gt, tokenizer)
predictions.append({
"pred_answer": pred_answer,
"gt_answers": gt_answer,
})
def calculate_metrics(self, prediction, labels):
## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
batch_size = len(prediction)
ac_score = 0
for (pred, gt) in zip(prediction, labels):
calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
ac_score = evaluator.eval_pred_list(predictions)
ac_score = torch.tensor(ac_score).cuda()
return ac_score |
Hi @FrankZxShen and @tinaboya2023, sorry for the late reply. I have been a bit busy because of some of my work. I would be again starting to modify this repository since there are a few things, which I felt were missing. Of course, the main reason was, that I had been just exploring to implement the models, but had no idea about the evaluation criteria of TextVQA. Here is what, I am planning now. Currently, the entire approach is abstractive in nature, but I believe the authors have taken a generative approach since it removes a lot of data preparation steps as well as the problems of finding the answers not present in the context.
Earlier, I wasn't even sure about how to use T5 for it, but now, I implemented a similar paper, and after going through T5's implementation from Hugging Face, I guess I can improve it. I would also shortly try to add the pre-training script, as I was able to prepare a subset of the IDL Dataset for pre-training. I know, it is a bit of a big task, but I would try to add it. And as mentioned, I am trying to learn on the way, and would really love to know your suggestions and comments. For the remark made by @tinaboya2023, I would have a look at the metric and would let you know soon. Had been busy learning stuff, and studying, so delayed a lot. Again apologies and looking forward to implementing the repo. |
Hi @FrankZxShen
I change len(answers) in m4c_evaluator.py but can’t solve it.
|
Hi @uakarsh Thank you for your follow up and I am very eager to see the modifies soon. It seems that if these changes are applied, the accuracy will be much higher. |
Hi @tinaboya2023, for your 2nd question, I believe it is the correct approach, to use |
Hey @tinaboya2023 |
Thank you for your work! @uakarsh I hope the problem can be resolved soon. |
Hi @FrankZxShen |
Hi again @tinaboya2023 def _compute_answer_scores(self, raw_answers):
"""
compute the accuracy (soft score) of human answers
"""
answers = [self.answer_processor(a) for a in raw_answers]
# assert len(answers) == 10
gt_answers = list(enumerate(answers))
unique_answers = set(answers)
unique_answer_scores = {}
for unique_answer in unique_answers:
accs = []
for gt_answer in gt_answers:
other_answers = [
item for item in gt_answers if item != gt_answer
]
matching_answers = [
item for item in other_answers if item[1] == unique_answer
]
acc = min(1, float(len(matching_answers)) / 3)
accs.append(acc)
unique_answer_scores[unique_answer] = sum(accs) / len(accs)
return unique_answer_scores Of course, we can expect changes from the repo author. I'm not sure that's the right thing to do in response to your questions. |
Hi Mr @FrankZxShen |
Hi @tinaboya2023 |
Hi @FrankZxShen |
I had the same question, like when you see here, in the last part of the notebook, there are multiple answers, and I am not sure how to do it. I figured encoding context and questions in a single sentence (since there is a need to include the question and the context as well), however, still not sure about how to encode the answer. By the way, I realized that there is a need to rewrite a part of the codebase (since I complicated stuff earlier), however, I will complete it. I found some references here, maybe it can help in the evaluation part. Could you suggest any way, in which we can encode the answer, and evaluate it? Once that is clear, I can handle the other part. Here is what I am thinking, take one of the answers from the given answer, as done here. Maybe, is it good to go? I would be going through this repo, to know more. Any comments @tinaboya2023 @FrankZxShen? |
Hi @FrankZxShen @tinaboya2023, I tried implementing the metrics (since the other parts have been completed). It is here, like @FrankZxShen mentioned, I commented on the |
Hi @uakarsh , @FrankZxShen I have another question too. About watching step by step that you wrote in
So now you should change something parts of this class. For example |
Hi @tinaboya2023 , I did try to make a set of notebooks here: https://github.com/uakarsh/latr/tree/main/examples/new_textvqa I think, it should be helpful |
Hi again @uakarsh , |
Hi, in
calculate_acc_score
function it seems, you calculate evaluation with only sum and average for accuracy_score in python but in fact for TextVQA maybe you should calculate evaluation with below function. Of course may be I'm wrong.Acc(ans) = min{(ha/3),1}
The text was updated successfully, but these errors were encountered: