-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics to report in the paper #145
Comments
Hi @fedenanni, thanks for the careful explanation, I do agree with you after reading this, but we should add your explanation to the paper. I like it because it ties in very well with the type application we aim to develop, one that helps historians explore a particular sense, and this makes the overall idea of targeted sense disambiguation more clear to the reviewer. if you could rehash the above explanation to the paper, that'd be great! I guess we don't need to change the code a lot. Just rerun the scripts for the computing the tables. |
ciao @kasparvonbeelen ! No problem, I'll write it out and check that the flow is consistent across the paper. I added a flag in the |
@fedenanni , Haha, no worries. After reading your comments I totally understand your point and agree. I think it will make the paper stronger (even if the numbers are bit lower ;-) ) |
Thanks @fedenanni sounds good to me as well! I've updated the notebook where results are computed and changed the numbers in the paper accordingly. Could you have a look just in case? https://github.com/Living-with-machines/HistoricalDictionaryExpansion/blob/dev/create_results_tables.ipynb |
Ah @kasparvonbeelen one question: is BERT1900 trained on data until 1900 or 1920? If the latter then should be change the name to BERT1920 so it's aligned with the experiment? |
@mcollardanuy I am not sure actually if there was a cut-off for training BERT, I think @kasra-hosseini knows. 1900 is a proxy-for "whole nineteenth-century book corpus" (which has a few books later 1900 I suppose.) For the experiments, I used 1760-1920 to refer to the "long nineteenth-century" as it is a more historically motivate periodization. Hope this is clear? |
|
Hi @kasra-hosseini, I meant BLERT. Is 1900 the end date as well? |
Hi, no, there is no end date on BLERT. Sorry, let me correct myself. The two BERT models used here are:
|
I come to this late, but I agree with @fedenanni 's explanation, it's convincing especially for the use case (historical research) we have in mind |
Hi all, me again on this point. I thought about it a bit and I'll try to explain here why I would report precision, recall and F1 for the label 1 instead of the macro average.
Remember that we are in a binary classification scenario with very unbalanced labels and we want to know which method is the best one at correctly predicting the 1 label (so the best one at finding correct occurrences of a specific sense). Now, consider this setting, where you have gold labels and three approaches: a majority class baseline (which always predicts 0), a random baseline and "our approach" which sometimes predicts it correctly. We want to know if we are better than the baselines in capturing the 1 cases.
If we compute precision and recall for each class, for each method, this is what we get:
Note! I computed the macro F1 score manually from the values of precision and recall. if you use scikit learn out of the box you will get:
[0.667, 0.75, 0.583]
where 0.583 is not the harmonic mean of p (0.667) and r (0.75), but the average of the F1 of label_1 (0.5) and label_0 (0.667)), because of this issue. Reporting[0.667, 0.75, 0.583]
, I believe, would make the reader (and especially the reviewer) very confused, so I added a patch in #144 at least to compute themacro F1
correctly.You can try yourself - you'll get the same behaviour for all the other methods.
Now, remember that our goal is to assess which method is better at finding 1s. If we consider
macro
it seems thatrandom
andour
are not that distant, and overall this seems an easy task (if you just randomly predict you get it right around 70% of the time, for the different metrics):however, if you look at label 1:
The story is a bit different and it is closer to reality (especially for precision). The task is actually hard and if you guess randomly you will return lots of false positives. Majority is a useless approach for label 1 because the majority class is label 0, so we will never return anything for label 1.
If we have to suggest to a historian what is the best method for finding occurrences of a specific sense of machine (so the goal of our ACL paper), based on this numbers we will tell them that with our approach 50% of the retrieved results will be correct (precision) with a perfect recall, while if they go with random only 33% of the retrieved results will be correct. The performance on macro are not informative to the final user, because the final user does not care about how we perform on label
0
.To conclude, I would report the results for
label 1
, because I think it is the most meaningful metric for the task (even if numbers will all be a bit lower - but they will more precisely represent the experimental setting and the goal of the paper). I can write this part of the paper justifying it.@kasparvonbeelen @BarbaraMcG @mcollardanuy @kasra-hosseini @GiorgiatolfoBL let me know what you think and especially if you spot any error as I might just miss something. However, if you prefer to go with
macro
, no problem, but we should change a bit the argumentation in the paper maybe, so that the metric is more in line with the problem.The text was updated successfully, but these errors were encountered: