Parametric bootstrap simulations for global models #125

MengLu-flw · 2024-01-03T16:09:31Z

MengLu-flw
Jan 3, 2024

Hi all :^)

I would like to perform parametric bootstraps via ‘gimble simulate’ for two purposes (1) to compare two nested models DIV and IM_BA by ΔlnCL (the improvement in fit), (2) to obtain 95% confidence intervals (CIs) of the parameters inferred under the best fitting global model.

I am not sure how to simulate a dataset partitioned into windows that is analogous to the empirical data - i.e., how to define -w (the number of windows per replicate) and -n (the number of blocks per window).

For example, here is the ‘gimble info’ based on my empirical data:

……
[+] │  ├── sample sets ................................................................................       91
[+] │  ├── INTER-population sample-sets (X)                                                          49
[+] │  ├── INTRA-population sample-sets (A)                                                          21
[+] │  └── INTRA-population sample-sets (B)                                                          21
……
[+] blocks/                                                                                            
[+] ├── '-l 64 -m 128 -u 3 -i 3' ...............................  238,732,763 blocks (22.54% discarded)
[+] │                                                                                                            X                  A                  B
[+] ├── BED interval sites in blocks (estimated %)                   16.94%        13.90%        50.31%
[+] ├── Total blocks                                                                90,953,757    31,988,378   115,790,628
[+] ├── Invariant blocks                                                                18.84%        39.64%        39.19%
[+] ├── Four-gamete-violation blocks                                          1.60%         0.29%         0.24%
……
[+] windows/                                                                                           
[+] └── ‘-w 76560 -s 15312’ …........................... 5,913 windows of inter-population (X) blocks

I thought to simulate an analogous dataset without down-sampling, I would have:

-l 64 (block length 64, same as I defined for my empirical data)

-w (not sure about how to define this, but I thought if each window has recombination, then it makes biological sense to me to have the number of windows equal to the number of chromosomes in my empirical data)

-n (not sure about how to define this, but I thought I would have the total number of blocks to be simulated as “empirical total block numbers (X)/empirical number of sample-sets (X) =90,953,757/49=1,856,199.122”. And I thought it would be like w*n=1,856,199.)

--replicates 100, --kmax 2,2,2,2.
--samples_A, --samples_B and --mu will be the same as my empirical data.
--model, --Ne_A, --Ne_B, --Ne_AB, --T, --me will use the same values as estimated under corresponding models based on empirical data.

My question at this step:
(1) Is my interpretation of the -w and -n correct?
(2) If I need to do down-sampling, how should I do it? i.e., if I need to do 10% down-sampling for simulated data compared to the empirical data amount, shall I just decrease the number of windows (-w) for my simulations?

Despite I didn’t fully understand how to define -w and -n (apologies), I had a go with ‘gimble simulate’ with the command below:

gimble simulate -z Lina_Gstore.z \
    -e 79 -p 10 \
    -k 2,2,2,2 -u 7e-09 --rec_rate 1.85 \
    -m DIV -s DIV_294w -r 100\
    --windows 294 --blocks 6213 --block_length 64 \
    -a 7 -b 7 \
-A 372683 -B 251619 -C 1009867 -T 763249

The recombination rate that I used is an averaged rate for plants (https://doi.org/10.1098/rstb.2016.0455, Table 1), so it is a very crude estimate...

The command above took 30h:27m:00.704s to complete.

To get the 95% CIs for the DIV model, I re-fit this simulated data to a DIV model with the same parameter boundaries that I set for my empirical dataset.

gimble optimize -z Lina_Gstore.z -d simulate/DIV_294w -w \
                -g CRS2 -e 38 -i 10000 \
                -r A -u 7e-09 \
                -A 20_000,1_500_000 -B 20_000,1_500_000 \
                -C 200_000,3_000_000 -T 100_000,5_000_000 \
                -m DIV -l DIV_DIVmodelCompare

Then, I queried this label to get a summary output of these 100 replicates.
gimble query -z Lina_Gstore.z -l optimize/DIV_294w.windowsum/DIV_DIVmodelCompare
I found that the range of 100 bootstraps does not cover the estimated value of DIV based on my empirical data.
For example, the ranges based on simulated data are:
Ne_A_B: 890,126.572 - 895,261.233
T: 1,025,487.446 - 1,032,241.796
Ne_A: 457,520.165 - 462,095.671
Ne_B: 300,091.436 - 302,540.218

While, the parameters estimated based on empirical data (also defined in the simulation) are:
Ne_A_B: 1,009,867
T: 763,249
Ne_A: 372,683
Ne_B: 251,619

My question: Is this ‘deviation’ caused by the recombination rate? Or have I done something wrong when simulating data?

Thank you so much! Very looking forward to your insightful reply!

Happy New Year :^)
Meng

GertjanBisschop · 2024-01-29T21:22:01Z

GertjanBisschop
Jan 29, 2024
Maintainer

@KLohse could you provide some insight here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametric bootstrap simulations for global models #125

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Parametric bootstrap simulations for global models #125

MengLu-flw Jan 3, 2024

Replies: 1 comment

GertjanBisschop Jan 29, 2024 Maintainer

MengLu-flw
Jan 3, 2024

GertjanBisschop
Jan 29, 2024
Maintainer