Skip to content

agladstein/SimPrily

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimPrily

Created by Ariella Gladstein, based on code from Consuelo Quinto Cortes and Krishna Veeramah. Also worked on by David Christy, Logan Gantner, and Mack Skodiak.
[email protected]

About

SimPrily runs genome simulations with user defined parameters or parameters randomly generated by priors and computes genomic statistics on the simulation output.
Version 1

  1. Run genome simulation with model defined by prior distributions of parameters and demographic model structure.
  2. Take into account SNP array ascertainment bias by creating pseudo array based on priors of number of samples of discovery populations and allele frequency cut-off.
  3. Calculate genomic summary statistics on simulated genomes and pseudo arrays.

This is ideal for use with Approximate Bayesian Computation on whole genome or SNP array data.

Uses c++ programs macs and GERMLINE. For more information on these programs, see:
https://github.com/gchen98/macs
https://github.com/sgusev/GERMLINE

Install

cd to the directory you want to work in,

git clone https://github.com/agladstein/SimPrily.git

Environment Set up

If using Vagrant (this is recommended if running on non-Linux OS):

Start Vagrant, ssh into Vagrant, cd to SimPrily directory:

vagrant up
vagrant ssh
cd /vagrant

Install the virtual environment and install the requirements.

./setup/setup_env_vbox_2.7.sh

If not using Vagrant, just install the virtual environment and install the requirements:

./setup/setup_env_2.7.sh

Usage

e.g. One Test simulation:

python simprily.py -p examples/eg1/param_file_eg1_asc.txt -m examples/eg1/model_file_eg1_asc.csv -g genetic_map_b37/genetic_map_GRCh37_chr1.txt.macshs -a array_template/ill_650_test.bed -i 1 -o output_dir -v

For quick help:

python simprily.py --help

Input

simprily.py takes 4 required arguments and 2 optional arguments, and help, verbose, and profile options.

Run as

python simprily.py [-h] -p PARAM -m MODEL -i ID -o OUT [-g MAP] [-a ARRAY] [-v] [--profile]
Required

-p PARAM or --param PARAM = The location of the parameter file
-m MODEL or --model MODEL = The location of the model file
-i ID or --id ID = The unique identifier of the job
-o OUT or --out OUT = The location of the output directory

Optional

-h or --help = shows a help message and exists
-v = increase output verbosity. This includes 3 levels, -v, -vv, and -vvv
--profile = Print a log file containing the time in seconds and memory use in Mb for main functions
-g MAP or --map MAP = The location of the genetic map file
-a ARRAY or --array ARRAY = The location of the array template file, in bed form

Output

Three subdirectories are created in the directory specified in the output_dir argument.

output_dir/results
output_dir/sim_data
output_dir/germline_out
Intermediate files

Intermediate files go to output_dir/sim_data and output_dir/germline_out.
output_dir/sim_data contains PLINK formated .ped and .map files created from the pseudo array, which are necessary to run GERMLINE.
output_dir/germline_out contains the GERMLINE .match output and .log. The .match contains all of the identified IBD segments.
These files are NOT automatically removed in python script, but are unnecessary once the job is complete.

Results files

Output files go to output_dir/results.
output_dir/results contains the parameter values used in the simulation and the summary statistics calculated from the simulation.
The first line is a header with the parameter names and summary statistics names. The second line is the parameter values and summary statistics values.


Open Science Grid

Must have an Open Science Grid Connect account.
Create an account at https://osgconnect.net/signup

Log onto Open Science Grid Connect

Working directory must be pegasus_workflow.

Submit a Pegasus workflow (must be in pegasus_workflow)

./submit -p PARAM -m MODEL -j NUM [-g MAP] [-a ARRAY]

e.g.

./submit -p ../examples/eg2/param_file_eg2_asc.txt -m ../examples/eg2/model_file_eg2_asc.csv -j 10 -a ../array_template/ill_650_test.bed -g ../genetic_map_b37/genetic_map_GRCh37_chr1.txt.macshs

The results will appear in

/local-scratch/user-name/workflows/simprily_id

where user-name is specific to the user, and id is the workflow id.


Links

Known Issues

  • If exponential growth is large, macs simulation will not finish. (This is a macs bug).
  • If the same id is used with the same output dir as a previous run, the .map file will be appended to.