Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify in docs and tutorials the output column in the results' dataset #617

Closed
rubenalv opened this issue Jun 27, 2024 · 1 comment · Fixed by #618
Closed

Clarify in docs and tutorials the output column in the results' dataset #617

rubenalv opened this issue Jun 27, 2024 · 1 comment · Fixed by #618
Assignees
Labels
docs Improvements or additions to documentation

Comments

@rubenalv
Copy link

(Apologies if the question is too obvious).
After training and testing with the tutorial https://github.com/DeepRank/deeprank2/blob/main/tutorials/training.ipynb, I get to this:

>>> output_test
     phase  epoch                      entry                                     output  target      loss
0  testing   18.0  residue-ppi:M-P:BA-113208   [0.4746413230895996, 0.5253586769104004]     1.0  0.668939
1  testing   18.0  residue-ppi:M-P:BA-135488   [0.4774721562862396, 0.5225278735160828]     1.0  0.668939
2  testing   18.0  residue-ppi:M-P:BA-136144    [0.533593475818634, 0.4664064645767212]     0.0  0.668939
3  testing   18.0  residue-ppi:M-P:BA-114113   [0.4767359495162964, 0.5232640504837036]     0.0  0.668939

The target is defined (for ppi classification) as not binding (0) or binding (1), in the same tutorial. I thought the values in output were the result of the softmax (left for probability of not binding, right for prob of binding), but that does not match with the binary target column in the example above. So what are these values?

@rubenalv rubenalv added the feature New feature to add label Jun 27, 2024
@gcroci2
Copy link
Collaborator

gcroci2 commented Jun 27, 2024

I thought the values in output were the result of the softmax (left for probability of not binding, right for prob of binding), but that does not match with the binary target column in the example above. So what are these values?

They are actually the result of a softmax, as you can see from here. The output column (output) contain a list with two elements as you thought indeed, respectively representing the predicted probabilities that the data point is 0 (first element of the list, representing non-binder class in the tutorial's example) and 1 (second element of the list, representing binder class).

The binary target column (target) is, as the name suggests, the target you aim at (true value, true label). If the output does not match the target, then the model is not good. In the 4 rows you reported above, the model is predicting the right outputs 3 times out of 4 if you take 0.5 as a threshold: entry 0 would be a 1 (0.52 probability is above 0.5 threshold, so the prediction would count as a 1), entry 1 would be a 1, and entry 2 would be a 0. Only entry 3 would be uncorrect, since a threshold of 0.5 would give a 1, which is not correct since the target in this case is 0. In general in the tutorials' cases you can't expect good results since we're using only 100 data points in total, and even less to train the models.

The reason for leaving the probabilities for all the classes (in this case only two classes, 0-class and 1-class) is that there are many different ways of computing metrics. Depending on the predictor's application, some users may be interested in the 0-class probability only, or in the 1-class probability only, and also the thresholds for deciding when to have a class or another one can be tuned in very different ways.

I will leave this issue open for clarifying this further in the tutorials and in the documentation.

I hope this clears up your question, but please let me know if that's not the case :)

@gcroci2 gcroci2 changed the title In output_test, what are the values in the column "outputs"? Clarify in docs and tutorials the output column in the results' dataset Jun 27, 2024
@gcroci2 gcroci2 added docs Improvements or additions to documentation and removed feature New feature to add labels Jun 27, 2024
@gcroci2 gcroci2 self-assigned this Jul 1, 2024
@gcroci2 gcroci2 linked a pull request Jul 1, 2024 that will close this issue
@gcroci2 gcroci2 closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants