Author: Vince Forgetta
Contact: [email protected]
To conduct rare variant analysis on a genome wide scale using programs such as VT, SKAT, and RR. The pipeline uses Grid Engine to parallelize computation.
- R libraries:
- VT and its dependencies: Rsge, getopt, doMC
- SKAT and its dependencies.
- Linux
- Grid Engine with a queue named all.q
- Python, Perl, gawk
- tabix
Update the PATH variable to the tabix executable in bin/tabix.bash
Once all requirements have been satisfied, run the test suite:
cd test
make test
Verify that results appear in */*_results.txt:
watch cat \*/\*_results.txt
Results may take a short while to appear. Check the status of the job using qstat.
The following files are required by the pipeline:
-
VCF file(s): One VCF file for each chromosome, named [0-9XY]+.*.vcf.gz, e.g., 1.beagle.impute2.anno.csq.20120427.vcf.gz. If available, PolyPhen scores are extracted and used as weights in the rare variant association tests. If not available, weight value is set to 0.5. This directory is specified to the pipeline via the -v option.
-
Region file: A single file specifying the regions from which to extract SNPs from the VCF files. Format of the file contents are:
10 100003847 100004653 C10orf28
10 100007442 100008748 LOXL4
10 100010821 100010933 LOXL4
10 100011322 100011459 LOXL4
10 100012109 100012225 LOXL4
10 100013309 100013553 LOXL4
10 100015333 100015496 LOXL4
10 100016536 100016704 LOXL4Each line consists of 4 fields: chrom, start, end, target_name. Chrom number should match the chromosome designation used for naming the VCF files. A region file may consist of exon coordinates for all human genes, where target_name is the gene name. This file is specified to the pipeline via the -r option.
-
Target file: A file containing target names, one per line. Target names are the values in column 4 of the region file that you want to process through the pipeline. Example:
C10orf28
LOXL4
This file is specified to the pipeline via the -t option.
-
Phenotype file: A file containing phenotype values per sample. Format of the file is:
sample_id_1 value1
sample_id_2 value2
sample_id_3 value3
sample_id_4 value4
sample_id_5 value5
sample_id_6 value6Each line consists of 2 fields: sample name, phenotype value. The file is specified to the pipeline via the -p option.
-
** Parameter file:** To avoid entering a myraid of parameters on the commandline, a file containing additional parameters is required by the pipeline. Format of the file is:
# SNP FILTERS
MAF=0.01
# WINDOW PARAMETERS
WIN_RUN=Y
WIN_SIZE=50
WIN_STEP=25
WIN_MIN=20
# VT PARAMETERS
VT_RUN=Y
VT_NPERM=1000
VT_USE_WEIGHTS=TRUE
# SKAT PARAMETERS
SKAT_RUN=Y
SKAT_USE_WEIGHTS=TRUE
SKAT_TRAIT_TYPE=C
SKAT_RESAMP_SIZE=1000
# RR PARAMETERS
RR_RUN=Y
RR_USE_WEIGHTS=TRUE
RR_NPERM=1000
RR_MAXPERM=1000000
RR_NONSIG=250
A template has been provided for you to easily modify these parameters. Copy paste the area indicated on the template to a text file e.g. params.txt. Specify this file as a parameter to the pipeline (-x option).
The following describes each of these parameters:
-
MAF: Maximum minor allele frequency to retain SNPs, i.e. drop SNPs above 0.01 MAF.
-
WIN_RUN: Whether to split targets into overlapping windows.
-
WIN_SIZE: Window size. This number of SNPs will be in each window for analysis.
-
WIN_STEP: Step size. The next window will step to the right of the previous window by this many SNPs.
-
WIN_MIN: Minimum window size. If the last window for a target contains fewer that this many SNPs it is merged with the previous window.
-
VT_RUN: Choose to run VT analysis.
-
VT_NPERM: Permutations for VT analysis
-
VT_USE_WEIGHTS: Apply Polyphen weights in VT analysis
-
SKAT_RUN: Choose to run SKAT analysis.
-
SKAT_USE_WEIGHTS: Apply Polyphen weights in SKAT analysis
-
SKAT_TRAIT: Phenotype is continuouis (C) or discrete (D).
-
SKAT_RESAMP_SIZE: Resampling size for SKAT.
-
RR_RUN: Choose to run RR analysis.
-
RR_USE_WEIGHTS: Apply Polyphen weights in RR analysis
-
RR_NPERM: Permutations for RR analysis.
-
RR_MAXPERM: Max allowable permutations for RR analysis.
-
RR_NONSIG: Non-significant test count during permutation at which to stop the permutation process.
-
Covariates file: A file containing covariate values per sample. This file is optional, and is not used for VT (this RV test doesnot support covariates). Format of the file is:
sample_id_1 value value
sample_id_2 value value
sample_id_3 value value
sample_id_4 value value
sample_id_5 value value
sample_id_6 value value
Each line consists of 2 or more columns. The first column is always the sample name. Subsequent columns are the covariate values. The file is specified to the pipeline via the -c option.
The pipeline consists of a series of steps.
A flow diagram illustrating the pipeline is depicted below:
Three input files are required (in black background): a directory with VCF files, a regions file, a target list file, and a phenotype file, and a parameter file (not shown). Intermediate results are boxed and programs are within pointed boxes.
Once these input file are ready, you can proceed to execute the pipeline (see "How to Run the Pipeline" below).
The followin described each major step in the pipeline:
The pipeline's smallest unit of work is a target (e.g. gene), which consists of one or more regions (e.g. exons) with a common target name (e.g. gene name). This step in the pipeline splits the regoin file (-r option) into a series of smaller files based on the target name in the region file.This step of the pipeline does not used Grid Engine. For ~20,000 human genes it completes in under 1 min.
This step fetches the SNPs from the VCF files (-v option) within the set of regions for each target specified in the target file (-t option). Upon completion of all RV tests, these files are deleted to save disk space.
If the WIN_RUN parameter be set to "Y", this step further splits each target VCF file into overlapping windows of SNPs. Results of each window are saved and new target names are created by appending the window number to the target name e.g. LOXL4.1, LOXL4.2, etc.
This step runs the selected RV tests on all targets specified in the target file (-t option) using the phenotype specified in the phenotype file (-p option).
Each rare variant association test produced one output directory i.e. <rv_test_name>.output (e.g. skat.output). For example, the SKAT results directory would contain:
- skat_results.txt: This is the PRIMARY results file. Depending on the association test, the columns will vary, but all should contains the target name being tested and one or more p-values from the association test.
- output/: Output file for the association tests on a per target basis. While highly verbose, may be useful in getting more details or inspecting why certain jobs failed.
- jobs/: A directory containing jobs submitted to Grid Engine job array.
- log/: A directory containing output from the pipeline. Might be useful when inspecting why certain jobs failed
- submit_jobs.skat5544.bash: The script used to submit the Grid Engine job array.
To run the pipeline execute the following command:
run\_pipeline.bash
-r regions.txt \
-t targets.txt \
-v <path_to_VCF_files> \
-p pheno.txt \
-x params.txt \
A description of all command line flags:
-h Show help.
-r [filename] Regions file (required).
-t [filename] Target file (required).
-v [directory] Source directory with genotype files in VCF format (required).
-p [file] Phenotype file (required).
-x [file] Parameter file (required).
-c [file] Covariates file (optional).
-k Keep temporary VCF files (optional).
Example running pipeline for all exons from from the human genome:
Generate a list of target names from the region file:
cut -f 4 -d ' ' regions.txt | sort | uniq > targets.txt
Alternatively, to try 50 randomly selected genes:
cut -f 4 -d ' ' regions.txt | sort | uniq | sort -R | head -n 50 > targets.txt
Generate phenotype file (this one as multiple phenotypes per sample, the 5th phenotype is chosen, header is also removed):
cut -f 1,6 pheno.txt | tail -n +2 > pheno.txt
Copy default parameters from template to param.txt.
Run pipeline for the 50 random targets:
run_pipeline.bash \
-r regions.txt \
-t targets.rand50.txt \
-v <path_to_chromosome_VCF_files> \
-p pheno.txt \
-x params.txt