Skip to content

Latest commit

 

History

History
66 lines (44 loc) · 3.05 KB

README.md

File metadata and controls

66 lines (44 loc) · 3.05 KB

Iterative Neighborhood detection to classify mixtures from Whole genome sequencing (WGS)

Objective:

Identify WGS samples infected with multiple strains of Mycobacterium samples or other slow evolving pathogens

Motivation:

Phylogenetic trees shows the overall ancestral relations between a set of samples. However, the limitation of a tree structure is that each child node can only have one parent node.

Mixture samples, however, may have the SNP pattern of multiple ancestral strains. Identify mixture samples before constructing a phylognetic tree, and split these amples by each of their contributing strains will improve inference of who-infected-whom.

Especially for individuals infected with different host species, identifying mixing patterns will help to infer the directionality of transmission.

Method:

  • Original data:

    • Genotypes from SNP calling of WGS data
    • Minor allele read count for each SNP & sample
  • Input data: Principle components calculated from the allele frequencies or minor allele read count matrix

  • Overview:

  • The Iterative KNN has 5 steps:

    1. Mix: Select samples from each cluster and mix these samples to create pseudo mixtures.

  1. Merge: Merge pseudo mixtures with original data

  1. Calculate: After a KNN, for each sample, calculate q, the adjusted praction of neighbors that were pseudo mixtures (n_pseduo_neighbors).

  1. Remove: Automatically thresolding q using the bimodal distributions, and remove samples exceeding the thresold from the next round of KNN.

  2. Repeat: iterate i to iv until no samples exceed the cutoff.

  • Below is a simple illustration of the iterative KNN method: A total of 19/20 samples mixing at various ratios of two strains were detected correctly.

  • Below is a barplot showing the strain abundances within the first ten samples of three-strains mixtures