Identify WGS samples infected with multiple strains of Mycobacterium samples or other slow evolving pathogens
Phylogenetic trees shows the overall ancestral relations between a set of samples. However, the limitation of a tree structure is that each child node can only have one parent node.
Mixture samples, however, may have the SNP pattern of multiple ancestral strains. Identify mixture samples before constructing a phylognetic tree, and split these amples by each of their contributing strains will improve inference of who-infected-whom.
Especially for individuals infected with different host species, identifying mixing patterns will help to infer the directionality of transmission.
-
Original data:
- Genotypes from SNP calling of WGS data
- Minor allele read count for each SNP & sample
-
Input data: Principle components calculated from the allele frequencies or minor allele read count matrix
-
Overview:
-
The Iterative KNN has 5 steps:
- Mix: Select samples from each cluster and mix these samples to create pseudo mixtures.
- Merge: Merge pseudo mixtures with original data
- Calculate: After a KNN, for each sample, calculate q, the adjusted praction of neighbors that were pseudo mixtures (n_pseduo_neighbors).
-
Remove: Automatically thresolding q using the bimodal distributions, and remove samples exceeding the thresold from the next round of KNN.
-
Repeat: iterate i to iv until no samples exceed the cutoff.
- Below is a simple illustration of the iterative KNN method: A total of 19/20 samples mixing at various ratios of two strains were detected correctly.
- Below is a barplot showing the strain abundances within the first ten samples of three-strains mixtures