HGT_v_Contamination_assessor

This script takes existing metadata about each DNA/AA sequence, and uses that--in combination with an alien index value--to determine whether each sequence should be flagged as a contaminant.

Inputs

(parameter) AI cutoff used for screening. (default AI_cutoff = 0.01)
(file) Fasta file to decontaminate
(file) a 'supertsv' metadata file that contains the following fields
(seq_id | alien_index_value | num_splice_variants | lineage_of_best_BLAST_hit | spliced_leader[True/False] | polyA_tail[True/False])

Rationale

Alien indices can be used as heuristics to infer whether a sequence is likely to be native or foreign (e.g. a contaminant or HGT). But decontaminating a 'dirty' dataset based on AI alone is inadvisable, since this approach is also likely to remove bona fide HGT. To mitigate this problem of 'overcleaning,' I have broken AI cleaning into two steps.

A first-pass "flagging" step that flags alls seqs whose AI excede the cutoff.
A second-pass "rescue" step that uses sequence metadata to redeem certain sequences. For example:
A. Any sequence with a dinoflagellate spliced-leader is unflagged as native.
B. Any best-hit to prokaryotes is unflagged if it has:
i. A poly-A tail
ii. Multiple splice isoforms

This script also gathers metrics about the frequency of HGT from various groups.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
hgt_v_contam.py		hgt_v_contam.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HGT_v_Contamination_assessor

Inputs

Rationale

About

Releases

Packages

Languages

ggavelis/HGT_v_Contamination_assessor

Folders and files

Latest commit

History

Repository files navigation

HGT_v_Contamination_assessor

Inputs

Rationale

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages