Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy of your genbank annotations #15

Open
leannmlindsey opened this issue Jan 21, 2024 · 0 comments
Open

Accuracy of your genbank annotations #15

leannmlindsey opened this issue Jan 21, 2024 · 0 comments

Comments

@leannmlindsey
Copy link

leannmlindsey commented Jan 21, 2024

Hello, I wanted to test your compare_predictions_to_phages.py to make sure that it was working, so I used the tsv file containing the reference locations for phages in NC_002655.

I was expecting to get perfect results, since I was using the reference intervals from the Casjens 2003 paper as reported on the PHASTER website statistics page. Instead I got these results:

(base) [u1323098@notch164:scripts]$ python3 compare_predictions_to_phages.py -t /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb -r reference.tsv --fp --fn -v
Reading reference.tsv
Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb again to get the phage regions
Getting from 1879335 to 1897622
Getting from 3551577 to 3565707
Getting from 2966382 to 3015014
Getting from 2668339 to 2688870
Getting from 2285976 to 2330172
Getting from 300073 to 310251
Getting from 1897625 to 1908911
Getting from 1702185 to 1725748
Getting from 310756 to 323112
Getting from 1250521 to 1295458
Getting from 1330857 to 1391923
Getting from 1678706 to 1693737
Getting from 1849488 to 1879269
Getting from 1909139 to 1930250
Getting from 892845 to 930943
Getting from 1730065 to 1756006
Getting from 1626722 to 1673485
Getting from 1655548 to 1696145
Getting from 2743223 to 2788348
Getting from 2118738 to 2165694
Getting from 3263064 to 3270404
Getting from 1521574 to 1530771
Found 789 predicted prophage features
Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb
Comparing real and predicted
Found:

Test set:
Phage: 676 Not phage: 4832

Predictions:
Phage: 789 Not phage: 4709

TP: 641
FP: 158
TN: 4674
FN: 35

Accuracy: 0.965 (this is the ratio of the correctly labeled phage genes to the whole pool of genes
Precision: 0.802 (This is the ratio of correctly labeled phage genes to all predictions)
Recall: 0.948 (This is the fraction of actual phage genes we got right)
Specificity: 0.967 (This is the fraction of non phage genes we got right)
f1_score: 0.869 (this is the harmonic mean of precision and recall, and is the best measure when, as in this case, there is a big difference between the number of phage and non-phage genes)

It seems that there are some differences between the reference intervals listed in your supplementary table and the intervals listed on the PHASTER website.

Do you have a list of where the annotations came from that you are using? Thank you
LeAnn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant