Skip to content

Identify WGS samples infected with multiple strains of slow evolving pathogens

Notifications You must be signed in to change notification settings

yyw-informatics/MixtureDetection_KNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iterative Neighborhood detection to classify mixtures from Whole genome sequencing (WGS)

Objective:

Identify WGS samples infected with multiple strains of Mycobacterium samples or other slow evolving pathogens

Motivation:

Phylogenetic trees shows the overall ancestral relations between a set of samples. However, the limitation of a tree structure is that each child node can only have one parent node.

Mixture samples, however, may have the SNP pattern of multiple ancestral strains. Identify mixture samples before constructing a phylognetic tree, and split these amples by each of their contributing strains will improve inference of who-infected-whom.

Especially for individuals infected with different host species, identifying mixing patterns will help to infer the directionality of transmission.

Method:

  • Original data:

    • Genotypes from SNP calling of WGS data
    • Minor allele read count for each SNP & sample
  • Input data: Principle components calculated from the allele frequencies or minor allele read count matrix

  • Overview:

  • The Iterative KNN has 5 steps:

    1. Mix: Select samples from each cluster and mix these samples to create pseudo mixtures.

  1. Merge: Merge pseudo mixtures with original data

  1. Calculate: After a KNN, for each sample, calculate q, the adjusted praction of neighbors that were pseudo mixtures (n_pseduo_neighbors).

  1. Remove: Automatically thresolding q using the bimodal distributions, and remove samples exceeding the thresold from the next round of KNN.

  2. Repeat: iterate i to iv until no samples exceed the cutoff.

  • Below is a simple illustration of the iterative KNN method: A total of 19/20 samples mixing at various ratios of two strains were detected correctly.

  • Below is a barplot showing the strain abundances within the first ten samples of three-strains mixtures

About

Identify WGS samples infected with multiple strains of slow evolving pathogens

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages