Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA DB requirements #23

Open
tobiasko opened this issue Apr 18, 2023 · 9 comments
Open

FASTA DB requirements #23

tobiasko opened this issue Apr 18, 2023 · 9 comments

Comments

@tobiasko
Copy link

In your manuscript entitled "Triqler for Protein Summarization of Data from Data-Independent Acquisition Mass Spectrometry" you state that:

"The pipeline generated decoys for FDR calculations, which were discarded after DIA-NN processing. To circumvent the lack of decoys in output for Triqler, we concatenated shuffled entrapment sequences in the FASTA database."

Could you explain what these shuffled entrapment sequences are? Is this something one needs to add if the DIA-NN reports should be useable for triqler?

@patruong
Copy link
Contributor

Hi Tobias,

Triqler needs decoys to calculate the Q-value. However, the PSMs in the report.tsv output from DIA-NN usually are not mapped to decoy proteins. To circumvent this, DIA-NN can be run with a spectra library that includes shuffled entrapment sequences. To do this, you first add shuffled entrapment sequences to your FASTA file before constructing a spectral library. These shuffled entrapment sequences are basically shuffled amino acid sequences of the proteins in the FASTA file.

Alternatively, you could use OpenSwathDecoyGenerator to add decoys to your spectral library, but this method has crashed in a couple of data sets on which I have tried this on. I am not sure why.

Hope this clarifies.

@tobiasko
Copy link
Author

tobiasko commented Apr 18, 2023

Hmmm...How would I do this when DIA-NN was run in library-free mode? I thought DIA-NN is already using decoys internally, because it outputs a Decoy.Evidence and Decoy.CScore for each feature in the main report. This can't be used by triqler?

@tobiasko
Copy link
Author

tobiasko commented Apr 18, 2023

The library-free search starts the in silico digestion from a target-only FASTA database. I guess decoy generation happens on peptide or library level. One can write the resulting spectral lib to disc and it contains a column Decoy. I hence guess the lib is supplemented with decoy targets/transitions.

@patruong
Copy link
Contributor

Indeed, DIA-NN is already using decoy peptides internally to compute the FDRs. However, these decoy-peptides cannot be printed into the output report.tsv.

I am not entirely sure what Decoy.Evidence and Decoy.CScore are used for, but they are floats and Triqler denotes if they are decoys or not by parsing the prefix to a protein, i.e. a binary indicator.

See
DIA-NN generated decoy peptides: vdemichev/DiaNN#6
DIA-NN cannot generate the internally generated decoys as decoy proteins as output: vdemichev/DiaNN#117
DIA-NN cannot generate the internally generated decoy peptides: vdemichev/DiaNN#468

@tobiasko
Copy link
Author

tobiasko commented Apr 18, 2023

Well I guess those floats are the scores and evidence values of the corresponding decoy entry. Instead of adding a new line for each decoy, it just denotes how the decoy scored (skipping the details of how the decoy entity is structured).

@patruong
Copy link
Contributor

Hmm interesting... I thought about that too, but I could not find any information about how to threshold the scoring. Perhaps the same threshold as Mass.Evidence where values between 0.5-1.0 are considered decoys. Perhaps the Decoy.Evidence could be mapped to a binary indicator for the decoy PSM and then the protein belonging to these peptides could be marked as decoys. Let me think about this. Perhaps @MatthewThe can give some more feedback on this?

@tobiasko
Copy link
Author

Let's ask Vadim what it really contains ;-) I also couldn't find any documentation on this.

@tobiasko
Copy link
Author

tobiasko commented Apr 18, 2023

Do it get the suggestion of Clemens correctly: He generates a target + decoy FASTA DB with a specific decoy prefix (50% target + 50% decoy). Runs this through DIA-NN (which generates internally decoys of decoys) only to get explicit reporting? That sounds pretty wild! And if the decoy function uses sequence reversal a decoy of a decoy turns into a target again.

@patruong
Copy link
Contributor

Hmmm.. seems like it is redundant information.

Having a fasta file of 50/50 ratio target-decoy is correct. However, you might need to generate a separate spectral library before running DIA-NN in library-mode. I can't recall if it worked with a FASTA-file without spectral library, but for sure with a spectral library it will work.

Hahaha that's just funny :D... However, I'm not sure it works that way when they generate their decoy peptides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants