Corrections

- Added detail to usage.rst for metafounders. - Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well. - Removal of redundant code and tidying in accuracy simulation for metafounders.
AlphaGenes · Nov 14, 2024 · ac219eb · ac219eb
1 parent f778cc2
commit ac219eb
Show file tree

Hide file tree

Showing 3 changed files with 70 additions and 95 deletions.
diff --git a/docs/source/usage.rst b/docs/source/usage.rst
@@ -136,7 +136,9 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ
 
 The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations.
 
-The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used.
+The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequency for the base population before peeling using all available genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. 
+
+For a pedigree with multiple metafounders, the user has three options: (1) use ``-update_alt_allele_prob`` only, (2) use ``est_alt_allele_prob`` and ``-update_alt_allele_prob``, or (3) input estimates of the alternative allele frequencies for each metafounder via the ``-alt_allele_prob_file`` with or without ``-update_alt_allele_prob``.
 
 Hybrid peeling arguments 
 ------------------------
@@ -235,7 +237,7 @@ Example:
 Alternative Allele Probability File
 ===================================
 
-The alternative allele probability file allows for user-defined population alternative allele probabilities. This file contains the metafounder group denoted MF_x, where x is by default "1" but see ``-main_metafounder``, followed by alternative allele probabilities for all the markers. In case of multiple metafounders, provide multiple rows in the file. The default starting alternative allele probabilities are 0.5 for each marker. If you don't have information for some markers, provide 0.5 for these in the file.
+The alternative allele probability file allows for user-defined population alternative allele frequencies. This file contains the metafounder group denoted MF_x, where x is by default "1" but see ``-main_metafounder``, followed by alternative allele frequencies for all the markers. In case of multiple metafounders, provide multiple rows in the file. The default starting alternative allele frequencies are 0.5 for each marker. If you don't have information for some markers, provide 0.5 for these in the file.
 
 Example:
 

diff --git a/src/tinypeel/tinypeel.py b/src/tinypeel/tinypeel.py
@@ -56,9 +56,9 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False):
                         peelingInfo.nLoci, 0.5, dtype=np.float32
                     )
     if args.est_alt_allele_prob:
-        if args.alt_allele_prob_file is not None:
+        if len(pedigree.AAP) > 1:
             warnings.warn(
-                "-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders."
+                "-est_alt_allele_prob will overwrite any differences between metafounders. To avoid this, please use -update_alt_allele_prob instead"
             )
         PeelingUpdates.updateMaf(pedigree, peelingInfo)
     for i in range(args.n_cycle):

diff --git a/tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R b/tests/accuracy_tests/sim_for_alphapeel_accu_test/sim_for_metafounder_accu.R
@@ -13,8 +13,6 @@ for (parameter in (1:nparams)) {
 nIndPerGen <- nInd / nGen
 
 nLociAllPerChr <- floor(nLociAll / nChr)
-nLociHDPerChr <- nLociHD / nChr
-nLociLDPerChr <- nLociLD / nChr
 
 founderGenomes <- runMacs(nInd = nIndPerGen, nChr = nChr, segSites = nLociAllPerChr, split = 1000)
 SP <- SimParam$new(founderPop = founderGenomes)
@@ -31,14 +29,10 @@ collectData <- function(pop, data = NULL, population, generation) {
   remove <- FALSE
   if (is.null(data)) {
     remove <- TRUE
-    data <- vector(mode = "list", length = 5)
-    names(data) <- c("pedigree", "haploIBD", "genoIBS", "Haplotype", "Genotype")
+    data <- vector(mode = "list", length = 3)
+    names(data) <- c("pedigree", "Haplotype", "Genotype")
     data$pedigree <- data.frame(id = NA, population = NA, generation = NA,
-                                mid = NA, fid = NA,
-                                gv1 = NA, pv1 = NA,
-                                gv2 = NA, pv2 = NA)
-    data$haploIBD <- matrix(data = NA, ncol = sum(pop@nLoci))
-    data$genoIBS <- matrix(data = NA, ncol = sum(pop@nLoci))
+                                mid = NA, fid = NA)
     data$Haplotype <- matrix(data = NA, ncol = sum(pop@nLoci))
     data$Genotype <- matrix(data = NA, ncol = sum(pop@nLoci))
   }
@@ -47,56 +41,46 @@ collectData <- function(pop, data = NULL, population, generation) {
                                     population = population,
                                     generation = generation,
                                     mid = pop@mother,
-                                    fid = pop@father,
-                                    gv1 = pop@gv[, 1],
-                                    pv1 = pop@pheno[, 1],
-                                    gv2 = pop@gv[, 2],
-                                    pv2 = pop@pheno[, 2]))
-  data$haploIBD <- rbind(data$haploIBD,
-                         pullIbdHaplo(pop = pop))
-  data$genoIBS <- rbind(data$genoIBS,
-                        pullSegSiteGeno(pop = pop))
-  listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)
-  marker <- ""
-  i <- 1
-  for (i in 1:length(listLoci)){
-    ind <- paste("1_", listLoci[i], sep = "")
-    marker <- c(marker, ind)
-  }
-  marker <- marker[-1]
-
+                                    fid = pop@father
+                                    ))
   data$Haplotype <- rbind(data$Haplotype,
-                         pullMarkerHaplo(pop = pop, markers = marker))
+                         pullSegSiteHaplo(pop = pop))
   data$Genotype <- rbind(data$Genotype,
-                          pullMarkerGeno(pop = pop, markers = marker))
+                        pullSegSiteGeno(pop = pop))
   if (remove) {
     data$pedigree <- data$pedigree[-1, ]
-    data$haploIBD <- data$haploIBD[-1, ]
-    data$genoIBS <- data$genoIBS[-1, ]
     data$Haplotype <- data$Haplotype[-1, ]
     data$Genotype <- data$Genotype[-1, ]
   }
   return(data)
 }
 
 # Founder population & split
+# Number of individuals for each pop (A, B, and AB)
+subPopEarlyGen <- nIndPerGen*0.5
+subPopAperGen <- nIndPerGen*0.375
+subPopBperGen <- nIndPerGen*0.375
+subPopABperGen <- nIndPerGen*0.25
+nIndFoundersPerSubPop <- nIndPerGen*0.5
+nSelInd <- nIndPerGen*0.05
+
 founders <- newPop(rawPop = founderGenomes)
 founders <- setPheno(pop = founders, varE = diag(varE))
-popA <- founders[1:100]
-popB <- founders[101:200]
+popA <- founders[1:nIndFoundersPerSubPop]
+popB <- founders[(nIndFoundersPerSubPop+1):nIndPerGen]
 data <- collectData(pop = popA, data = NULL, population = "A", generation = 0)
 data <- collectData(pop = popB, data = data, population = "B", generation = 0)
 
 # Select on each trait and keep the populations separate
 for (generation in 1:nGen) {
-  parentsA <- selectInd(pop = popA, nInd = 10, trait = 1)
-  parentsB <- selectInd(pop = popB, nInd = 10, trait = 2)
+  parentsA <- selectInd(pop = popA, nInd = nSelInd, trait = 1)
+  parentsB <- selectInd(pop = popB, nInd = nSelInd, trait = 2)
   if (generation == nGen){
-    popA <- randCross(pop = parentsA, nCrosses = 75)
-    popB <- randCross(pop = parentsB, nCrosses = 75)
+    popA <- randCross(pop = parentsA, nCrosses = subPopAperGen)
+    popB <- randCross(pop = parentsB, nCrosses = subPopBperGen)
   } else {
-    popA <- randCross(pop = parentsA, nCrosses = 100)
-    popB <- randCross(pop = parentsB, nCrosses = 100)
+    popA <- randCross(pop = parentsA, nCrosses = subPopEarlyGen)
+    popB <- randCross(pop = parentsB, nCrosses = subPopEarlyGen)
   }
 
   popA <- setPheno(pop = popA, varE = diag(varE))
@@ -107,21 +91,21 @@ for (generation in 1:nGen) {
 
 # Continued selection on each trait in each separate population,
 # but add also continually admixed population selected on an index
-popAB <- randCross(pop = c(parentsA, parentsB), nCrosses = 50)
+popAB <- randCross(pop = c(parentsA, parentsB), nCrosses = subPopABperGen)
 popAB <- setPheno(pop = popAB, varE = diag(varE))
 data <- collectData(pop = popAB, data = data, population = "AB", generation = generation)
 economicWeights <- c(1, 1)
 selIndexWeights <- smithHazel(econWt = economicWeights, varG = varG, varP = varG + varE)
-for (generation in 6:10) {
-  parentsA <- selectInd(pop = popA, nInd = 10, trait = 1)
-  parentsB <- selectInd(pop = popB, nInd = 10, trait = 2)
-  parentsAB <- selectInd(pop = popAB, nInd = 6, trait = selIndex, scale = TRUE, b = selIndexWeights)
-  parentsA4AB <- selectInd(pop = popA, nInd = 2, trait = selIndex, scale = TRUE, b = selIndexWeights)
-  parentsB4AB <- selectInd(pop = popB, nInd = 2, trait = selIndex, scale = TRUE, b = selIndexWeights)
+for (generation in (nGen+1):(nGen*2)) {
+  parentsA <- selectInd(pop = popA, nInd = nSelInd, trait = 1)
+  parentsB <- selectInd(pop = popB, nInd = nSelInd, trait = 2)
+  parentsAB <- selectInd(pop = popAB, nInd = nSelInd*0.6, trait = selIndex, scale = TRUE, b = selIndexWeights)
+  parentsA4AB <- selectInd(pop = popA, nInd = nSelInd*0.2, trait = selIndex, scale = TRUE, b = selIndexWeights)
+  parentsB4AB <- selectInd(pop = popB, nInd = nSelInd*0.2, trait = selIndex, scale = TRUE, b = selIndexWeights)
   parentsAB <- c(parentsAB, parentsA4AB, parentsB4AB)
-  popA <- randCross(pop = parentsA, nCrosses = 75)
-  popB <- randCross(pop = parentsB, nCrosses = 75)
-  popAB <- randCross(pop = parentsAB, nCrosses = 50)
+  popA <- randCross(pop = parentsA, nCrosses = subPopAperGen)
+  popB <- randCross(pop = parentsB, nCrosses = subPopBperGen)
+  popAB <- randCross(pop = parentsAB, nCrosses = subPopABperGen)
   popA <- setPheno(pop = popA, varE = diag(varE))
   popB <- setPheno(pop = popB, varE = diag(varE))
   popAB <- setPheno(pop = popAB, varE = diag(varE))
@@ -132,72 +116,62 @@ for (generation in 6:10) {
 
 data$pedigree$population <- factor(data$pedigree$population, levels = c("A", "B", "AB"))
 summary(data$pedigree$population)
-data$pedigree$gv <- c(selIndex(Y = as.matrix(data$pedigree[, c("gv1", "gv2")]), scale = TRUE, b = selIndexWeights))
-data$pedigree$pv <- c(selIndex(Y = as.matrix(data$pedigree[, c("pv1", "pv2")]), scale = TRUE, b = selIndexWeights))
-
-data$pedigree$generationPlotShift <- data$pedigree$generation +
-  c(-0.25, +0.25, 0)[data$pedigree$population]
 
 # Get pedigree and assign metafounders
 pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation)
 colnames(pedigree) <- c("id", "mid", "fid", "population", "generation")
-pedigree <- pedigree[c(401:1400),]
-pedigree$mid[c(1:100)] <- "MF_1"
-pedigree$mid[c(101:200)] <- "MF_2"
-pedigree$fid[c(1:100)] <- "MF_1"
-pedigree$fid[c(101:200)] <- "MF_2"
+# Take nInd (1000) individuals across nGen (five) generations.
+pedStart <- nIndPerGen*2
+pedEnd <- pedStart + nInd
+pedigree <- pedigree[c((pedStart + 1):pedEnd),]
+pedigree$mid[c(1:nIndFoundersPerSubPop)] <- "MF_1"
+pedigree$mid[c((nIndFoundersPerSubPop + 1):nIndPerGen)] <- "MF_2"
+pedigree$fid[c(1:nIndFoundersPerSubPop)] <- "MF_1"
+pedigree$fid[c((nIndFoundersPerSubPop + 1):nIndPerGen)] <- "MF_2"
 
 pedigree <- pedigree[c(-4, -5)]
-pedigree_noMF <- pedigree
-pedigree_noMF$mid[c(1:200)] <- "0"
-pedigree_noMF$fid[c(1:200)] <- "0"
 
 # Get the genotypes (true) and observed (i.e with some missing)
 genotypes <- data.frame(data$Genotype)
-genotypes <- genotypes[c(401:1400),]
-
-randFounder <- sort(sample.int(100, 70))
+genotypes <- genotypes[(pedStart + 1):pedEnd, ]
 genotypes_obsv <- genotypes
-genotypes_obsv[randFounder, 1:2000] <- "9"
-randFounder <- sort(sample.int(100, 70)) + 100
-genotypes_obsv[randFounder, 1:2000] <- "9"
-randFounder <- sort(sample.int(100, 70)) + 600
-genotypes_obsv[randFounder, 1:2000] <- "9"
-randFounder <- sort(sample.int(100, 70)) + 200
-genotypes_obsv[randFounder, 1:2000] <- "9"
-randFounder <- sort(sample.int(100, 70)) + 700
-genotypes_obsv[randFounder, 1:2000] <- "9"
-randFounder <- sort(sample.int(100, 40)) + 300
-genotypes_obsv[randFounder, 1:2000] <- "9"
-randFounder <- sort(sample.int(100, 40)) + 800
-genotypes_obsv[randFounder, 1:2000] <- "9"
+
+# Define sample sizes and corresponding offsets
+nUnknown <- c(0.7, 0.7, 0.7, 0.7, 0.4, 0.4) * nIndFoundersPerSubPop
+offsets <- c(0, 1, 2, 3, 4, 5) * nIndFoundersPerSubPop
+
+for (i in seq_along(nUnknown)){
+  randFounder <- sort(sample.int(nIndFoundersPerSubPop, nUnknown[i])) + offsets[i]
+  genotypes_obsv[randFounder, 1:nLociAllPerChr] <- "9"
+}
 
 # Estimate the alternative allele frequencies.
-founders_genotypes <- genotypes[c(1:100),]
+founders_genotypes <- genotypes[c(1:nIndFoundersPerSubPop),]
 alt_allele_prob_A <- colSums(founders_genotypes)
 alt_allele_prob_A <- alt_allele_prob_A/(2*nrow(founders_genotypes))
 
-founders_genotypes <- genotypes[c(101:200),]
+founders_genotypes <- genotypes[c((nIndFoundersPerSubPop + 1):nIndPerGen),]
 alt_allele_prob_B <- colSums(founders_genotypes)
 alt_allele_prob_B <- alt_allele_prob_B/(2*nrow(founders_genotypes))
 
-alt_allele_prob_input_MF <- data.frame(matrix(nrow = 2, ncol = 2001))
+alt_allele_prob_input_MF <- data.frame(matrix(nrow = 2, ncol = (nLociAllPerChr + 1)))
 alt_allele_prob_input_MF[1] <- c("MF_1", "MF_2")
-alt_allele_prob_input_MF[1,2:2001] <- alt_allele_prob_A
-alt_allele_prob_input_MF[2,2:2001] <- alt_allele_prob_B
+alt_allele_prob_input_MF[1,2:(nLociAllPerChr + 1)] <- alt_allele_prob_A
+alt_allele_prob_input_MF[2,2:(nLociAllPerChr + 1)] <- alt_allele_prob_B
 
-founders_genotypes <- genotypes[c(1:200),]
+# Singular alternative allele frequency is not used in testing, but calculated for comparison.
+founders_genotypes <- genotypes[c(1:nIndPerGen),]
 alt_allele_prob <- colSums(founders_genotypes)
 alt_allele_prob <- alt_allele_prob/(2*nrow(founders_genotypes))
 
-alt_allele_prob_input_noMF <- data.frame(matrix(nrow = 1, ncol = 2001))
+alt_allele_prob_input_noMF <- data.frame(matrix(nrow = 1, ncol = (nLociAllPerChr + 1)))
 alt_allele_prob_input_noMF[1] <- "MF_1"
-alt_allele_prob_input_noMF[1,2:2001] <- alt_allele_prob
+alt_allele_prob_input_noMF[1,2:(nLociAllPerChr + 1)] <- alt_allele_prob
 
 # Haplotypes
 # Get haplotypes for phased and unphased genotype probabilities
 haplotypes <- data.frame(data$Haplotype)
-haplotypes <- haplotypes[c(401:1400),]
+haplotypes <- data$Haplotype[c((1+pedStart*2):(pedEnd*2)),]
 
 for (ind in 1:nInd) {
   maternal <- haplotypes[ind * 2 - 1, ]
@@ -206,7 +180,6 @@ for (ind in 1:nInd) {
 }
 
 # ----- Phased genotypes probability------
-haplotypes <- data$Haplotype[c(801:2800),]
 
 phasedGenotypes <- matrix(data = 0, nrow = nInd * 4, ncol = nLociAll + 1)
 for (ind in (1:nInd)) {
@@ -246,9 +219,9 @@ for (ind in (1:nInd)) {
   UnphasedGenotypes[((ind - 1) * 3 + 1):(ind * 3), 1] <- ind
 }
 
-# To match the ids (+ 400)
-phasedGenotypes[,1] <- phasedGenotypes[,1] + 400
-UnphasedGenotypes[,1] <- UnphasedGenotypes[,1] + 400
+# To match the ids
+phasedGenotypes[,1] <- phasedGenotypes[,1] + pedStart
+UnphasedGenotypes[,1] <- UnphasedGenotypes[,1] + pedStart
 
 # Save the results/data for testing
 write.table(x = pedigree, file = "metafounder_ped_file.txt",