Skip to content

Commit

Permalink
Corrections
Browse files Browse the repository at this point in the history
- Added detail to usage.rst for metafounders.
- Change of warning to user where they have multiple metafounders in an alternative allele probability file and use -est_alt_allele_prob as well.
- Removal of redundant code and tidying in accuracy simulation for metafounders.
  • Loading branch information
RosCraddock committed Nov 14, 2024
1 parent f778cc2 commit ac219eb
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 95 deletions.
6 changes: 4 additions & 2 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,9 @@ For hybrid peeling, where a large amount (millions of segregating sites) of sequ

The ``-geno_error_prob``, ``-seq_error_prob`` and ``-rec_length`` arguments control some of the model parameters used in the model. ``-seq_error_prob`` must not be zero. |Software| is robust to deviations in genotyping error rate and sequencing error rate so it is not recommended to use these options unless large deviations from the default are known. Changing the ``-length`` argument to match the genetic map length can increase accuracy in some situations.

The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequencies before peeling using all available observed genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders. For implementation of metafounders (**without** ``-alt_allele_prob_file``), both ``-est_alt_allele_prob`` and ``-update_alt_allele_prob`` should be used.
The ``-est_geno_error_prob`` and ``-est_seq_error_prob`` options estimate the genotyping error rate and the sequencing error rate based on miss-match between observed and inferred states. This option is generally not necessary and can increase runtime. ``-est_alt_allele_prob`` estimates the alternative allele frequency for the base population before peeling using all available genotypes. This option can be useful if there are a large number of non-genotyped founders. ``-update_alt_allele_prob`` re-estimates the alternative allele frequencies per metafounder after each peeling cycle using the inferred genotype probabilities of the founders.

For a pedigree with multiple metafounders, the user has three options: (1) use ``-update_alt_allele_prob`` only, (2) use ``est_alt_allele_prob`` and ``-update_alt_allele_prob``, or (3) input estimates of the alternative allele frequencies for each metafounder via the ``-alt_allele_prob_file`` with or without ``-update_alt_allele_prob``.

Hybrid peeling arguments
------------------------
Expand Down Expand Up @@ -235,7 +237,7 @@ Example:
Alternative Allele Probability File
===================================

The alternative allele probability file allows for user-defined population alternative allele probabilities. This file contains the metafounder group denoted MF_x, where x is by default "1" but see ``-main_metafounder``, followed by alternative allele probabilities for all the markers. In case of multiple metafounders, provide multiple rows in the file. The default starting alternative allele probabilities are 0.5 for each marker. If you don't have information for some markers, provide 0.5 for these in the file.
The alternative allele probability file allows for user-defined population alternative allele frequencies. This file contains the metafounder group denoted MF_x, where x is by default "1" but see ``-main_metafounder``, followed by alternative allele frequencies for all the markers. In case of multiple metafounders, provide multiple rows in the file. The default starting alternative allele frequencies are 0.5 for each marker. If you don't have information for some markers, provide 0.5 for these in the file.

Example:

Expand Down
4 changes: 2 additions & 2 deletions src/tinypeel/tinypeel.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ def runPeelingCycles(pedigree, peelingInfo, args, singleLocusMode=False):
peelingInfo.nLoci, 0.5, dtype=np.float32
)
if args.est_alt_allele_prob:
if args.alt_allele_prob_file is not None:
if len(pedigree.AAP) > 1:
warnings.warn(
"-est_alt_allele_prob will update the inputted alternative allele frequencies using all available observed genotypes. Therefore, will overwrite any differences between metafounders."
"-est_alt_allele_prob will overwrite any differences between metafounders. To avoid this, please use -update_alt_allele_prob instead"
)
PeelingUpdates.updateMaf(pedigree, peelingInfo)
for i in range(args.n_cycle):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@ for (parameter in (1:nparams)) {
nIndPerGen <- nInd / nGen

nLociAllPerChr <- floor(nLociAll / nChr)
nLociHDPerChr <- nLociHD / nChr
nLociLDPerChr <- nLociLD / nChr

founderGenomes <- runMacs(nInd = nIndPerGen, nChr = nChr, segSites = nLociAllPerChr, split = 1000)
SP <- SimParam$new(founderPop = founderGenomes)
Expand All @@ -31,14 +29,10 @@ collectData <- function(pop, data = NULL, population, generation) {
remove <- FALSE
if (is.null(data)) {
remove <- TRUE
data <- vector(mode = "list", length = 5)
names(data) <- c("pedigree", "haploIBD", "genoIBS", "Haplotype", "Genotype")
data <- vector(mode = "list", length = 3)
names(data) <- c("pedigree", "Haplotype", "Genotype")
data$pedigree <- data.frame(id = NA, population = NA, generation = NA,
mid = NA, fid = NA,
gv1 = NA, pv1 = NA,
gv2 = NA, pv2 = NA)
data$haploIBD <- matrix(data = NA, ncol = sum(pop@nLoci))
data$genoIBS <- matrix(data = NA, ncol = sum(pop@nLoci))
mid = NA, fid = NA)
data$Haplotype <- matrix(data = NA, ncol = sum(pop@nLoci))
data$Genotype <- matrix(data = NA, ncol = sum(pop@nLoci))
}
Expand All @@ -47,56 +41,46 @@ collectData <- function(pop, data = NULL, population, generation) {
population = population,
generation = generation,
mid = pop@mother,
fid = pop@father,
gv1 = pop@gv[, 1],
pv1 = pop@pheno[, 1],
gv2 = pop@gv[, 2],
pv2 = pop@pheno[, 2]))
data$haploIBD <- rbind(data$haploIBD,
pullIbdHaplo(pop = pop))
data$genoIBS <- rbind(data$genoIBS,
pullSegSiteGeno(pop = pop))
listLoci <- seq.int(from = 1, to = sum(pop@nLoci), by = 1)
marker <- ""
i <- 1
for (i in 1:length(listLoci)){
ind <- paste("1_", listLoci[i], sep = "")
marker <- c(marker, ind)
}
marker <- marker[-1]

fid = pop@father
))
data$Haplotype <- rbind(data$Haplotype,
pullMarkerHaplo(pop = pop, markers = marker))
pullSegSiteHaplo(pop = pop))
data$Genotype <- rbind(data$Genotype,
pullMarkerGeno(pop = pop, markers = marker))
pullSegSiteGeno(pop = pop))
if (remove) {
data$pedigree <- data$pedigree[-1, ]
data$haploIBD <- data$haploIBD[-1, ]
data$genoIBS <- data$genoIBS[-1, ]
data$Haplotype <- data$Haplotype[-1, ]
data$Genotype <- data$Genotype[-1, ]
}
return(data)
}

# Founder population & split
# Number of individuals for each pop (A, B, and AB)
subPopEarlyGen <- nIndPerGen*0.5
subPopAperGen <- nIndPerGen*0.375
subPopBperGen <- nIndPerGen*0.375
subPopABperGen <- nIndPerGen*0.25
nIndFoundersPerSubPop <- nIndPerGen*0.5
nSelInd <- nIndPerGen*0.05

founders <- newPop(rawPop = founderGenomes)
founders <- setPheno(pop = founders, varE = diag(varE))
popA <- founders[1:100]
popB <- founders[101:200]
popA <- founders[1:nIndFoundersPerSubPop]
popB <- founders[(nIndFoundersPerSubPop+1):nIndPerGen]
data <- collectData(pop = popA, data = NULL, population = "A", generation = 0)
data <- collectData(pop = popB, data = data, population = "B", generation = 0)

# Select on each trait and keep the populations separate
for (generation in 1:nGen) {
parentsA <- selectInd(pop = popA, nInd = 10, trait = 1)
parentsB <- selectInd(pop = popB, nInd = 10, trait = 2)
parentsA <- selectInd(pop = popA, nInd = nSelInd, trait = 1)
parentsB <- selectInd(pop = popB, nInd = nSelInd, trait = 2)
if (generation == nGen){
popA <- randCross(pop = parentsA, nCrosses = 75)
popB <- randCross(pop = parentsB, nCrosses = 75)
popA <- randCross(pop = parentsA, nCrosses = subPopAperGen)
popB <- randCross(pop = parentsB, nCrosses = subPopBperGen)
} else {
popA <- randCross(pop = parentsA, nCrosses = 100)
popB <- randCross(pop = parentsB, nCrosses = 100)
popA <- randCross(pop = parentsA, nCrosses = subPopEarlyGen)
popB <- randCross(pop = parentsB, nCrosses = subPopEarlyGen)
}

popA <- setPheno(pop = popA, varE = diag(varE))
Expand All @@ -107,21 +91,21 @@ for (generation in 1:nGen) {

# Continued selection on each trait in each separate population,
# but add also continually admixed population selected on an index
popAB <- randCross(pop = c(parentsA, parentsB), nCrosses = 50)
popAB <- randCross(pop = c(parentsA, parentsB), nCrosses = subPopABperGen)
popAB <- setPheno(pop = popAB, varE = diag(varE))
data <- collectData(pop = popAB, data = data, population = "AB", generation = generation)
economicWeights <- c(1, 1)
selIndexWeights <- smithHazel(econWt = economicWeights, varG = varG, varP = varG + varE)
for (generation in 6:10) {
parentsA <- selectInd(pop = popA, nInd = 10, trait = 1)
parentsB <- selectInd(pop = popB, nInd = 10, trait = 2)
parentsAB <- selectInd(pop = popAB, nInd = 6, trait = selIndex, scale = TRUE, b = selIndexWeights)
parentsA4AB <- selectInd(pop = popA, nInd = 2, trait = selIndex, scale = TRUE, b = selIndexWeights)
parentsB4AB <- selectInd(pop = popB, nInd = 2, trait = selIndex, scale = TRUE, b = selIndexWeights)
for (generation in (nGen+1):(nGen*2)) {
parentsA <- selectInd(pop = popA, nInd = nSelInd, trait = 1)
parentsB <- selectInd(pop = popB, nInd = nSelInd, trait = 2)
parentsAB <- selectInd(pop = popAB, nInd = nSelInd*0.6, trait = selIndex, scale = TRUE, b = selIndexWeights)
parentsA4AB <- selectInd(pop = popA, nInd = nSelInd*0.2, trait = selIndex, scale = TRUE, b = selIndexWeights)
parentsB4AB <- selectInd(pop = popB, nInd = nSelInd*0.2, trait = selIndex, scale = TRUE, b = selIndexWeights)
parentsAB <- c(parentsAB, parentsA4AB, parentsB4AB)
popA <- randCross(pop = parentsA, nCrosses = 75)
popB <- randCross(pop = parentsB, nCrosses = 75)
popAB <- randCross(pop = parentsAB, nCrosses = 50)
popA <- randCross(pop = parentsA, nCrosses = subPopAperGen)
popB <- randCross(pop = parentsB, nCrosses = subPopBperGen)
popAB <- randCross(pop = parentsAB, nCrosses = subPopABperGen)
popA <- setPheno(pop = popA, varE = diag(varE))
popB <- setPheno(pop = popB, varE = diag(varE))
popAB <- setPheno(pop = popAB, varE = diag(varE))
Expand All @@ -132,72 +116,62 @@ for (generation in 6:10) {

data$pedigree$population <- factor(data$pedigree$population, levels = c("A", "B", "AB"))
summary(data$pedigree$population)
data$pedigree$gv <- c(selIndex(Y = as.matrix(data$pedigree[, c("gv1", "gv2")]), scale = TRUE, b = selIndexWeights))
data$pedigree$pv <- c(selIndex(Y = as.matrix(data$pedigree[, c("pv1", "pv2")]), scale = TRUE, b = selIndexWeights))

data$pedigree$generationPlotShift <- data$pedigree$generation +
c(-0.25, +0.25, 0)[data$pedigree$population]

# Get pedigree and assign metafounders
pedigree <- data.frame(data$pedigree$id, data$pedigree$mid, data$pedigree$fid, data$pedigree$population, data$pedigree$generation)
colnames(pedigree) <- c("id", "mid", "fid", "population", "generation")
pedigree <- pedigree[c(401:1400),]
pedigree$mid[c(1:100)] <- "MF_1"
pedigree$mid[c(101:200)] <- "MF_2"
pedigree$fid[c(1:100)] <- "MF_1"
pedigree$fid[c(101:200)] <- "MF_2"
# Take nInd (1000) individuals across nGen (five) generations.
pedStart <- nIndPerGen*2
pedEnd <- pedStart + nInd
pedigree <- pedigree[c((pedStart + 1):pedEnd),]
pedigree$mid[c(1:nIndFoundersPerSubPop)] <- "MF_1"
pedigree$mid[c((nIndFoundersPerSubPop + 1):nIndPerGen)] <- "MF_2"
pedigree$fid[c(1:nIndFoundersPerSubPop)] <- "MF_1"
pedigree$fid[c((nIndFoundersPerSubPop + 1):nIndPerGen)] <- "MF_2"

pedigree <- pedigree[c(-4, -5)]
pedigree_noMF <- pedigree
pedigree_noMF$mid[c(1:200)] <- "0"
pedigree_noMF$fid[c(1:200)] <- "0"

# Get the genotypes (true) and observed (i.e with some missing)
genotypes <- data.frame(data$Genotype)
genotypes <- genotypes[c(401:1400),]

randFounder <- sort(sample.int(100, 70))
genotypes <- genotypes[(pedStart + 1):pedEnd, ]
genotypes_obsv <- genotypes
genotypes_obsv[randFounder, 1:2000] <- "9"
randFounder <- sort(sample.int(100, 70)) + 100
genotypes_obsv[randFounder, 1:2000] <- "9"
randFounder <- sort(sample.int(100, 70)) + 600
genotypes_obsv[randFounder, 1:2000] <- "9"
randFounder <- sort(sample.int(100, 70)) + 200
genotypes_obsv[randFounder, 1:2000] <- "9"
randFounder <- sort(sample.int(100, 70)) + 700
genotypes_obsv[randFounder, 1:2000] <- "9"
randFounder <- sort(sample.int(100, 40)) + 300
genotypes_obsv[randFounder, 1:2000] <- "9"
randFounder <- sort(sample.int(100, 40)) + 800
genotypes_obsv[randFounder, 1:2000] <- "9"

# Define sample sizes and corresponding offsets
nUnknown <- c(0.7, 0.7, 0.7, 0.7, 0.4, 0.4) * nIndFoundersPerSubPop
offsets <- c(0, 1, 2, 3, 4, 5) * nIndFoundersPerSubPop

for (i in seq_along(nUnknown)){
randFounder <- sort(sample.int(nIndFoundersPerSubPop, nUnknown[i])) + offsets[i]
genotypes_obsv[randFounder, 1:nLociAllPerChr] <- "9"
}

# Estimate the alternative allele frequencies.
founders_genotypes <- genotypes[c(1:100),]
founders_genotypes <- genotypes[c(1:nIndFoundersPerSubPop),]
alt_allele_prob_A <- colSums(founders_genotypes)
alt_allele_prob_A <- alt_allele_prob_A/(2*nrow(founders_genotypes))

founders_genotypes <- genotypes[c(101:200),]
founders_genotypes <- genotypes[c((nIndFoundersPerSubPop + 1):nIndPerGen),]
alt_allele_prob_B <- colSums(founders_genotypes)
alt_allele_prob_B <- alt_allele_prob_B/(2*nrow(founders_genotypes))

alt_allele_prob_input_MF <- data.frame(matrix(nrow = 2, ncol = 2001))
alt_allele_prob_input_MF <- data.frame(matrix(nrow = 2, ncol = (nLociAllPerChr + 1)))
alt_allele_prob_input_MF[1] <- c("MF_1", "MF_2")
alt_allele_prob_input_MF[1,2:2001] <- alt_allele_prob_A
alt_allele_prob_input_MF[2,2:2001] <- alt_allele_prob_B
alt_allele_prob_input_MF[1,2:(nLociAllPerChr + 1)] <- alt_allele_prob_A
alt_allele_prob_input_MF[2,2:(nLociAllPerChr + 1)] <- alt_allele_prob_B

founders_genotypes <- genotypes[c(1:200),]
# Singular alternative allele frequency is not used in testing, but calculated for comparison.
founders_genotypes <- genotypes[c(1:nIndPerGen),]
alt_allele_prob <- colSums(founders_genotypes)
alt_allele_prob <- alt_allele_prob/(2*nrow(founders_genotypes))

alt_allele_prob_input_noMF <- data.frame(matrix(nrow = 1, ncol = 2001))
alt_allele_prob_input_noMF <- data.frame(matrix(nrow = 1, ncol = (nLociAllPerChr + 1)))
alt_allele_prob_input_noMF[1] <- "MF_1"
alt_allele_prob_input_noMF[1,2:2001] <- alt_allele_prob
alt_allele_prob_input_noMF[1,2:(nLociAllPerChr + 1)] <- alt_allele_prob

# Haplotypes
# Get haplotypes for phased and unphased genotype probabilities
haplotypes <- data.frame(data$Haplotype)
haplotypes <- haplotypes[c(401:1400),]
haplotypes <- data$Haplotype[c((1+pedStart*2):(pedEnd*2)),]

for (ind in 1:nInd) {
maternal <- haplotypes[ind * 2 - 1, ]
Expand All @@ -206,7 +180,6 @@ for (ind in 1:nInd) {
}

# ----- Phased genotypes probability------
haplotypes <- data$Haplotype[c(801:2800),]

phasedGenotypes <- matrix(data = 0, nrow = nInd * 4, ncol = nLociAll + 1)
for (ind in (1:nInd)) {
Expand Down Expand Up @@ -246,9 +219,9 @@ for (ind in (1:nInd)) {
UnphasedGenotypes[((ind - 1) * 3 + 1):(ind * 3), 1] <- ind
}

# To match the ids (+ 400)
phasedGenotypes[,1] <- phasedGenotypes[,1] + 400
UnphasedGenotypes[,1] <- UnphasedGenotypes[,1] + 400
# To match the ids
phasedGenotypes[,1] <- phasedGenotypes[,1] + pedStart
UnphasedGenotypes[,1] <- UnphasedGenotypes[,1] + pedStart

# Save the results/data for testing
write.table(x = pedigree, file = "metafounder_ped_file.txt",
Expand Down

0 comments on commit ac219eb

Please sign in to comment.