关于困难蛋白质 #20

zhanght28 · 2024-04-16T15:19:32Z

the definition of the difficult proteins is: the sequence identity of the protein (in the training set) most similar (homologous) to a difficult protein is less than 60%.
你好，请问困难蛋白质的数据是通过max(hsp.identities / rec.query_length for hsp in alignment.hsps) < 0.6得到的吗？
我基于此得到的cc mf bp上的困难蛋白质在数量上和论文中给出的有10个左右的偏差。

yourh · 2024-04-16T15:54:17Z

不是很确定为什么，BLAST输出就有一个0~1之间的identity，然后cutoff是0.6，我用的BLAST迭代次数是1

zhanght28 · 2024-04-17T07:25:27Z

感谢您的回复，我是用测试集的psiblast的查询结果xx-test-ppi-blast-out.xml为依据查询的，可能需要用psiblast跑一下训练集的结果？

yourh · 2024-04-17T07:44:24Z

哦，是，要跑训练集的

zhanght28 · 2024-04-17T09:41:41Z

identity是不是也要根据blast的结果进一步计算得到呢

yourh · 2024-04-17T14:58:30Z

是的，BLAST的输出结果里直接就有identity，然后是选所有hsp里最大的

zhanght28 · 2024-04-17T15:03:35Z

我是通过：
max(hsp.identities / rec.query_length for hsp in alignment.hsps)
计算的，这个结果计算出来有偏差，所以我在考虑是不是计算方式有问题

wlin16 mentioned this issue Sep 18, 2023

data.zip extract failure #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于困难蛋白质 #20

关于困难蛋白质 #20

zhanght28 commented Apr 16, 2024

yourh commented Apr 16, 2024

zhanght28 commented Apr 17, 2024

yourh commented Apr 17, 2024

zhanght28 commented Apr 17, 2024

yourh commented Apr 17, 2024

zhanght28 commented Apr 17, 2024

关于困难蛋白质 #20

关于困难蛋白质 #20

Comments

zhanght28 commented Apr 16, 2024

yourh commented Apr 16, 2024

zhanght28 commented Apr 17, 2024

yourh commented Apr 17, 2024

zhanght28 commented Apr 17, 2024

yourh commented Apr 17, 2024

zhanght28 commented Apr 17, 2024