Skip to content

How can we discriminate contaminants from HGT? Alien indices are often used to screen out foreign sequences, but can 'overclean' by removing bona fide HGT. This script leverages metadata about each DNA/AA sequence (i.e. whether it is spliced, has a polyA tail or spliced leader), and uses that to assess the extent to which AI-based cleaning is re…

Notifications You must be signed in to change notification settings

ggavelis/HGT_v_Contamination_assessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

HGT_v_Contamination_assessor

This script takes existing metadata about each DNA/AA sequence, and uses that--in combination with an alien index value--to determine whether each sequence should be flagged as a contaminant.

Inputs

  1. (parameter) AI cutoff used for screening. (default AI_cutoff = 0.01)
  2. (file) Fasta file to decontaminate
  3. (file) a 'supertsv' metadata file that contains the following fields
    (seq_id | alien_index_value | num_splice_variants | lineage_of_best_BLAST_hit | spliced_leader[True/False] | polyA_tail[True/False])

Rationale

Alien indices can be used as heuristics to infer whether a sequence is likely to be native or foreign (e.g. a contaminant or HGT). But decontaminating a 'dirty' dataset based on AI alone is inadvisable, since this approach is also likely to remove bona fide HGT. To mitigate this problem of 'overcleaning,' I have broken AI cleaning into two steps.

  1. A first-pass "flagging" step that flags alls seqs whose AI excede the cutoff.
  2. A second-pass "rescue" step that uses sequence metadata to redeem certain sequences. For example:
    A. Any sequence with a dinoflagellate spliced-leader is unflagged as native.
    B. Any best-hit to prokaryotes is unflagged if it has:
          i. A poly-A tail
          ii. Multiple splice isoforms

This script also gathers metrics about the frequency of HGT from various groups.

About

How can we discriminate contaminants from HGT? Alien indices are often used to screen out foreign sequences, but can 'overclean' by removing bona fide HGT. This script leverages metadata about each DNA/AA sequence (i.e. whether it is spliced, has a polyA tail or spliced leader), and uses that to assess the extent to which AI-based cleaning is re…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages