Skip to content

Analyze genomics regions using multiple rare variant association tests

Notifications You must be signed in to change notification settings

vforget/rare-variant-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Rare Variant Analysis Pipeline

Author: Vince Forgetta

Contact: [email protected]

Synopsis

To conduct rare variant analysis on a genome wide scale using programs such as VT, SKAT, and RR. The pipeline uses Grid Engine to parallelize computation.

Requirements

  • R libraries:
  • VT and its dependencies: Rsge, getopt, doMC
  • SKAT and its dependencies.
  • Linux
  • Grid Engine with a queue named all.q
  • Python, Perl, gawk
  • tabix

Install

Update the PATH variable to the tabix executable in bin/tabix.bash

Once all requirements have been satisfied, run the test suite:

cd test
make test

Verify that results appear in */*_results.txt:

watch cat \*/\*_results.txt

Results may take a short while to appear. Check the status of the job using qstat.

Input Files

The following files are required by the pipeline:

  • VCF file(s): One VCF file for each chromosome, named [0-9XY]+.*.vcf.gz, e.g., 1.beagle.impute2.anno.csq.20120427.vcf.gz. If available, PolyPhen scores are extracted and used as weights in the rare variant association tests. If not available, weight value is set to 0.5. This directory is specified to the pipeline via the -v option.

  • Region file: A single file specifying the regions from which to extract SNPs from the VCF files. Format of the file contents are:

    10 100003847 100004653 C10orf28
    10 100007442 100008748 LOXL4
    10 100010821 100010933 LOXL4
    10 100011322 100011459 LOXL4
    10 100012109 100012225 LOXL4
    10 100013309 100013553 LOXL4
    10 100015333 100015496 LOXL4
    10 100016536 100016704 LOXL4

    Each line consists of 4 fields: chrom, start, end, target_name. Chrom number should match the chromosome designation used for naming the VCF files. A region file may consist of exon coordinates for all human genes, where target_name is the gene name. This file is specified to the pipeline via the -r option.

  • Target file: A file containing target names, one per line. Target names are the values in column 4 of the region file that you want to process through the pipeline. Example:

    C10orf28
    LOXL4

This file is specified to the pipeline via the -t option.

  • Phenotype file: A file containing phenotype values per sample. Format of the file is:

    sample_id_1 value1
    sample_id_2 value2
    sample_id_3 value3
    sample_id_4 value4
    sample_id_5 value5
    sample_id_6 value6

    Each line consists of 2 fields: sample name, phenotype value. The file is specified to the pipeline via the -p option.

  • ** Parameter file:** To avoid entering a myraid of parameters on the commandline, a file containing additional parameters is required by the pipeline. Format of the file is:

    # SNP FILTERS
    MAF=0.01
    # WINDOW PARAMETERS
    WIN_RUN=Y
    WIN_SIZE=50
    WIN_STEP=25
    WIN_MIN=20
    # VT PARAMETERS
    VT_RUN=Y
    VT_NPERM=1000
    VT_USE_WEIGHTS=TRUE
    # SKAT PARAMETERS
    SKAT_RUN=Y
    SKAT_USE_WEIGHTS=TRUE
    SKAT_TRAIT_TYPE=C
    SKAT_RESAMP_SIZE=1000
    # RR PARAMETERS
    RR_RUN=Y
    RR_USE_WEIGHTS=TRUE
    RR_NPERM=1000
    RR_MAXPERM=1000000
    RR_NONSIG=250

A template has been provided for you to easily modify these parameters. Copy paste the area indicated on the template to a text file e.g. params.txt. Specify this file as a parameter to the pipeline (-x option).

The following describes each of these parameters:

  • MAF: Maximum minor allele frequency to retain SNPs, i.e. drop SNPs above 0.01 MAF.

  • WIN_RUN: Whether to split targets into overlapping windows.

  • WIN_SIZE: Window size. This number of SNPs will be in each window for analysis.

  • WIN_STEP: Step size. The next window will step to the right of the previous window by this many SNPs.

  • WIN_MIN: Minimum window size. If the last window for a target contains fewer that this many SNPs it is merged with the previous window.

  • VT_RUN: Choose to run VT analysis.

  • VT_NPERM: Permutations for VT analysis

  • VT_USE_WEIGHTS: Apply Polyphen weights in VT analysis

  • SKAT_RUN: Choose to run SKAT analysis.

  • SKAT_USE_WEIGHTS: Apply Polyphen weights in SKAT analysis

  • SKAT_TRAIT: Phenotype is continuouis (C) or discrete (D).

  • SKAT_RESAMP_SIZE: Resampling size for SKAT.

  • RR_RUN: Choose to run RR analysis.

  • RR_USE_WEIGHTS: Apply Polyphen weights in RR analysis

  • RR_NPERM: Permutations for RR analysis.

  • RR_MAXPERM: Max allowable permutations for RR analysis.

  • RR_NONSIG: Non-significant test count during permutation at which to stop the permutation process.

  • Covariates file: A file containing covariate values per sample. This file is optional, and is not used for VT (this RV test doesnot support covariates). Format of the file is:

    sample_id_1 value value
    sample_id_2 value value
    sample_id_3 value value
    sample_id_4 value value
    sample_id_5 value value
    sample_id_6 value value

Each line consists of 2 or more columns. The first column is always the sample name. Subsequent columns are the covariate values. The file is specified to the pipeline via the -c option.

Pipeline Overview

The pipeline consists of a series of steps.

A flow diagram illustrating the pipeline is depicted below:

Three input files are required (in black background): a directory with VCF files, a regions file, a target list file, and a phenotype file, and a parameter file (not shown). Intermediate results are boxed and programs are within pointed boxes.

Once these input file are ready, you can proceed to execute the pipeline (see "How to Run the Pipeline" below).

The followin described each major step in the pipeline:

STEP 1: Split Region File by Target

The pipeline's smallest unit of work is a target (e.g. gene), which consists of one or more regions (e.g. exons) with a common target name (e.g. gene name). This step in the pipeline splits the regoin file (-r option) into a series of smaller files based on the target name in the region file.This step of the pipeline does not used Grid Engine. For ~20,000 human genes it completes in under 1 min.

STEP 2: Fetch SNPs By Target

This step fetches the SNPs from the VCF files (-v option) within the set of regions for each target specified in the target file (-t option). Upon completion of all RV tests, these files are deleted to save disk space.

STEP3: Split Target VCF into Overlapping Windows

If the WIN_RUN parameter be set to "Y", this step further splits each target VCF file into overlapping windows of SNPs. Results of each window are saved and new target names are created by appending the window number to the target name e.g. LOXL4.1, LOXL4.2, etc.

STEP 4: Run Rare Variant Association Tests

This step runs the selected RV tests on all targets specified in the target file (-t option) using the phenotype specified in the phenotype file (-p option).

Each rare variant association test produced one output directory i.e. <rv_test_name>.output (e.g. skat.output). For example, the SKAT results directory would contain:

  • skat_results.txt: This is the PRIMARY results file. Depending on the association test, the columns will vary, but all should contains the target name being tested and one or more p-values from the association test.
  • output/: Output file for the association tests on a per target basis. While highly verbose, may be useful in getting more details or inspecting why certain jobs failed.
  • jobs/: A directory containing jobs submitted to Grid Engine job array.
  • log/: A directory containing output from the pipeline. Might be useful when inspecting why certain jobs failed
  • submit_jobs.skat5544.bash: The script used to submit the Grid Engine job array.

How to Run the Pipeline

To run the pipeline execute the following command:

run\_pipeline.bash  
    -r regions.txt \
	-t targets.txt \
	-v <path_to_VCF_files> \
	-p pheno.txt \
	-x params.txt \

A description of all command line flags:

-h                  Show help.
-r    [filename]    Regions file (required).
-t    [filename]    Target file (required).
-v    [directory]   Source directory with genotype files in VCF format (required).
-p    [file]        Phenotype file (required).
-x    [file]        Parameter file (required).
-c    [file]        Covariates file (optional).
-k                  Keep temporary VCF files (optional).

Example running pipeline for all exons from from the human genome:

Generate a list of target names from the region file:

cut -f 4 -d ' ' regions.txt | sort | uniq > targets.txt

Alternatively, to try 50 randomly selected genes:

cut -f 4 -d ' ' regions.txt | sort | uniq | sort -R | head -n 50 > targets.txt

Generate phenotype file (this one as multiple phenotypes per sample, the 5th phenotype is chosen, header is also removed):

cut -f 1,6 pheno.txt | tail -n +2 > pheno.txt

Copy default parameters from template to param.txt.

Run pipeline for the 50 random targets:

run_pipeline.bash \  
    -r regions.txt \  
	-t targets.rand50.txt \  
	-v <path_to_chromosome_VCF_files> \  
	-p pheno.txt \  
	-x params.txt

About

Analyze genomics regions using multiple rare variant association tests

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published