Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于困难蛋白质 #20

Open
zhanght28 opened this issue Apr 16, 2024 · 6 comments
Open

关于困难蛋白质 #20

zhanght28 opened this issue Apr 16, 2024 · 6 comments

Comments

@zhanght28
Copy link

the definition of the difficult proteins is: the sequence identity of the protein (in the training set) most similar (homologous) to a difficult protein is less than 60%.
你好,请问困难蛋白质的数据是通过max(hsp.identities / rec.query_length for hsp in alignment.hsps) < 0.6得到的吗?
我基于此得到的cc mf bp上的困难蛋白质在数量上和论文中给出的有10个左右的偏差。

@yourh
Copy link
Owner

yourh commented Apr 16, 2024

不是很确定为什么,BLAST输出就有一个0~1之间的identity,然后cutoff是0.6,我用的BLAST迭代次数是1

@zhanght28
Copy link
Author

感谢您的回复,我是用测试集的psiblast的查询结果xx-test-ppi-blast-out.xml为依据查询的,可能需要用psiblast跑一下训练集的结果?

@yourh
Copy link
Owner

yourh commented Apr 17, 2024

哦,是,要跑训练集的

@zhanght28
Copy link
Author

identity是不是也要根据blast的结果进一步计算得到呢

@yourh
Copy link
Owner

yourh commented Apr 17, 2024

是的,BLAST的输出结果里直接就有identity,然后是选所有hsp里最大的

@zhanght28
Copy link
Author

我是通过:
max(hsp.identities / rec.query_length for hsp in alignment.hsps)
计算的,这个结果计算出来有偏差,所以我在考虑是不是计算方式有问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants