A method for allele specific analyses across cell states and conditions. stratAS
implements population-scale tests and predictive models for allele-specific activity (typically RNA or ATAC data).
- A
.vcf
file containing phased individual genotypes and an "AS" field listing the allelic counts. This gets turned into a flat file to be loaded in the--input
flag (see below). - A samples file with the columns
ID
andCONDITION
that is in the same order as the vcf input.CONDITION
is currently restricted to 0/1. This file goes into the--samples
flag. - A
.bed
file listing the physical positions of the features to be tested for allelic imbalance. This file goes into the--peaks
flag. - Global/local parameter files generated by
params.R
(see below). At minimum a global overdispersion parameter file with the columnsID
andPHI
with entries in any order goes into the--global_params
flag. - R, and the
VGAM
,optparse
libraries installed.
- Phase and impute your genotype data with EAGLE on the Haplotype Reference Consortium imputation server.
- Process all sequence data with the WASP mapping pipeline.
- Use the GATK ASEReadCounter to count reads then convert into vcf (see
pipeline/
scripts).
params.R
infers the read count distribution (beta binomial) parameters and needs to be run on a whole genome vcf one individual at a time. The input is an --inp_counts
file which must contain the headers CHR POS HAP REF.READS ALT.READS
where HAP
is the 0|1
or 1|0
vcf haplotype code.
A vcf file containing one individual is converted to counts as follows:
zcat $VCF \
| grep -v '#' \
| cut -f 1,2,10 \
| tr ':' '\t' \
| awk '$3 != "1|1" && $3 != "0|0"' \
| cut -f 1-3,7 \
| tr ',' '\t' \
| sed 's/chr//' \
| awk 'BEGIN { print "CHR POS HAP REF.READS ALT.READS" } $4 + $5 > 0' \
> $OUT.counts
For tumor data, local CNV-specific parameters are also estimated, and the --inp_cnv
file must be provided with headers CHR P0 P1
listing the boundaries of CNV regions. This file can additionally contain a CNV
column listing the CNV estimate (centered to zero as in TCGA calls) for inclusion as a covariate in the final analysis.
inference is then performed by running:
Rscript params.R --min_cov 5 --inp_counts $OUT.counts --inp_cnv $CNV --out $OUT
A $OUT.global.params
file is generated containing the parameters (with header PHI MU N
) and (optionally) an $OUT.local.params
file is generated containing the positions and parameters for each CNV (with header CHR P0 P1 PHI MU N
).
stratas.R
computes the AS statistics from a VCF of all individuals and the prior parameters estimated above.
A vcf file containing all individuals is converted to counts and split into batches as follows:
zcat $VCF\
| grep -v '#' \
| cut -f 1-5,9- \
| tr ':' '\t' \
| awk '{ for(i=1;i<=5;i++) { printf "%s ",$i; } for(i=6;i<=10;i++) { if($i == "GT") gtnum=i-6; if($i=="AS") asnum=i-6; } for(i=11;i<=NF;i++) { if( (i-11) % 5 == asnum || (i-11) % 5 == gtnum ) printf " %s",$i; } print ""; }' \
| sed 's/[|,]/ /g' | split -d -l 15000 - $OUT.MAT.split.
Each batch is then processed as follows:
Rscript stratas.R \
--input $OUT.MAT.split.00 \
--samples SAMPLES.ID \
--peaks gencode.protein_coding.transcripts.bed \
--global_param GLOBAL.params \
--local_param LOCAL.params \
stratas.R
can compute multi-variate predictive models for downstream integration with GWAS data, using both allelic and total activity.
Inputs are similar to the tests for allele specificity described above and additionally require a total activity matrix, covariates, and the predict flags:
Rscript stratas.R \
--input $OUT.MAT.split.00 \
--samples SAMPLES.ID \
--peaks gencode.protein_coding.transcripts.bed \
--global_param GLOBAL.params \
--local_param LOCAL.params \
--total_matrix KIRC.gexp \
--covar KIRC.gexp.cov \
--predict_snps HM3.extract \
--predict --predict_only > $OUT.profile
This generates predictive model files for every feature in the TWAS/FUSION format, which can then be integrated with GWAS data using the FUSION software. The generated model types inside each model file are "lasso", "lasso.as", "lasso.combined", "top1.as", "top1", "top1.combined"
; where lasso
versus top
corresponds to full locus penalized versus top SNP models; and *
, *.as
, *.combined
indicates models using total activity, allele-specific activity, or both jointly. An $OUT.profile
file is generated which lists model performance characteristics.
Important: This analysis differs from the individual allele-specific tests in two ways: (1) the CONDITION value is not used and one model is trained on all samples; (2) models are still trained even if no allele-specific informatino is available. To restrict to models with both sources of signal (which are likely to be more accurate as a group), run the allele-specific tests described above and retain only peaks that have non-NA values.
Example total activity input files (KIRC.gexp, KIRC.gexp.cov, and HM3.extract) based on TCGA data are provided in the example/
directory.
By default, the outputs are printed to screen with each line containing the following entries:
Column | Description |
---|---|
CHR | Chromosome |
POS | Position of test SNP |
RSID | ID of test SNP |
P0 | Start of gene/peak |
P1 | End of gene/peak |
NAME | Name of gene/peak |
CENTER | Center position of peak (or TSS for gene) |
N.HET | # of heterozygous individuals tested |
N.READS | # of reads tested in total |
ALL.AF | Allelic fraction estimate from beta binomial test across both conditions |
ALL.BBINOM.P | Beta-binomial test for imbalance across both conditions |
C0.AF | Allelic fraction estimate from condition 0 |
C0.BBINOM.P | Beta-binomial test for imbalance in condition 0 |
C1.AF | Allelic fraction estimate from condition 1 |
C1.BBINOM.P | Beta-binomial test for imbalance in condition 1 |
DIFF.BBINOM.P | Beta-binomial test for difference between conditions |
Enabling the --total_matrix
flag additionally runs a standard eQTL test (and interaction term), producing the following columns:
Column | Description |
---|---|
ALL.TOTAL.Z | Test statisic from linear model across all samples |
ALL.TOTAL.P | P-value from linear model across all samples |
C0.TOTAL.Z | Test statisic from linear model across condition 0 |
C0.TOTAL.P | P-value from linear model across condition 0 |
C1.TOTAL.Z | Test statisic from linear model across condition 1 |
C1.TOTAL.P | P-value from linear model across condition 1 |
DIFF.TOTAL.Z | Test statistic for difference/interaction between conditions |
DIFF.TOTAL.P | P-value for difference/interaction between conditions |
Enabling the --combine
flag additionally estimates the (inverse-variance weighted) combined statistics from the allele-specific and total models, producing the following columns:
Column | Description |
---|---|
ALL.COMBINE.Z | Test statisic from linear model across all samples |
ALL.COMBINE.P | P-value from linear model across all samples |
C0.COMBINE.Z | Test statisic from linear model across condition 0 |
C0.COMBINE.P | P-value from linear model across condition 0 |
C1.COMBINE.Z | Test statisic from linear model across condition 1 |
C1.COMBINE.P | P-value from linear model across condition 1 |
DIFF.COMBINE.Z | Test statistic for difference between conditions |
DIFF.COMBINE.P | P-value for difference between conditions |
Enabling the --binom
flag additionally runs a standard binomial test (and Fisher's test for interaction), producing the following columns:
Column | Description |
---|---|
ALL.BINOM.P | Binomial test for imbalance across both conditions |
ALL.C0.BINOM.P | Binomial test for imbalance in condition 0 |
ALL.C1.BINOM.P | Binomial test for imbalance in condition 1 |
FISHER.OR | Fisher's test odd's ratio for difference between conditions |
FISHER.DIFF.P | Fisher's test difference between conditions |
Enabling the --bbreg
flag additionally runs a beta binomial regression with CNV as covariate, and produces the following columns:
Column | Description |
---|---|
ALL.BBREG.P | Beta binomial regression (with covariates) for imbalance across both conditions |
DIFF.BBREG.P | Beta binomial regression (with covariates) for imbalance difference between conditions |
CNV.BBREG.P | Beta binomial regression (with covariates) for imbalance along CNV covariate |
Enabling the --indiv
flag additionally produces the following columns:
Column | Description |
---|---|
IND.C0 | Number of each condition 0 individual included in this test (comma separated) |
IND.C0.COUNT.REF | condition 0 REF allele counts of each individual included in this test (comma separated) |
IND.C0.COUNT.ALT | condition 0 ALT allele counts of each individual included in this test (comma separated) |
IND.C1 | Number of each condition 1 individual included in this test (comma separated) |
IND.C1.COUNT.REF | condition 1 REF allele counts of each individual included in this test (comma separated) |
IND.C1.COUNT.ALT | condition 1 ALT allele counts of each individual included in this test (comma separated) |
An example locus with significant AS associations can be run by calling:
Rscript stratas.R \
--input example/ENSG00000075240.12.mat \
--samples example/KIRC.ALL.AS.PHE \
--peaks example/ENSG00000075240.12.bed \
--global_param example/KIRC.ALL.AS.CNV \
--local_param=example/KIRC.ALL.AS.CNVLOCAL
All parameters can be displayed by running --help
.
Parameter | |
---|---|
--input |
Path to input file |
--samples |
Path to sample identifier file, must have ID and CONDITION columns |
--peaks |
Path to file containing peak/gene boundaries, must contain CHR P0 P1 NAME CENTER columns |
--global_param |
Path to global overdispersion parameter file |
--local_param |
Path to local overdispersion parameter file |
--out |
Path to output |
Parameter | |
---|---|
--exclude |
The mimium distance between SNPs allowed in the haplotype (to exclude variants in the same read) |
--keep |
Path to file listing samples to retain for analyses |
--min_cov |
Individuals must have at least this many reads (for both alleles) to be tested |
--min_het |
Minimum minor heterozygous frequency for test SNP |
--min_maf |
Minimum minor allele frequency for test SNP |
--max_rho |
Maximum local/global over-dispersion parameter for which to include individual in test |
--window |
Window (in bp) for SNPs to test around the peak boundary |
Parameter | |
---|---|
--covar |
Path to covariates for total activity |
--total_matrix |
Path to matrix of total activity, enables the linear model. Note: if --local_param is on, then CNV is included as covariate |
--total_rn |
Rank normalize the total expression phenotype |
Parameter | |
---|---|
--predict |
Build predictive models for each feature (for downstream TWAS/RWAS/CWAS analysis) |
--predict_only |
Do not perform individual allele-specific tests |
--predict_snps |
Path to file listing SNPs for which predictor weights should be fit (typically common HapMap3 SNPs) |
Parameter | |
---|---|
--bbreg |
Also perform a beta binomial regression with local CNV status as a covariate (must also provide --local_param file) |
--binom |
Also perform a standard binomial test |
--collapse_reads |
Merge all sites for each individual prior to analysis (not recommended) |
--combine |
Output combined BBINOM and TOTAL statistics by Stouffer's method, must have all relevant flags for basic and total input |
--mbased |
Also perform the MBASED test for differences (requires MBASED libraries) |
Parameter | |
---|---|
--cond_cnv_mean |
Use MU as the covariate for BBREG, otherwise uses CNV as the covariate (requires --bbreg) |
--fill_cnv |
Set individuals with missing CNV calls to diploid and rho=0.01 (assuming they are neutral) |
--mask_cnv_cutoff |
Mask out any sites that have an absolute CNV value above this cutoff (NA = no masking) |
--min_cnv_round |
Absolute CNV values below this cutoff get set to zero |
Parameter | |
---|---|
--sim |
Simulate imbalance based on the input data and test it. Allelic-fraction specified by --sim_af |
--sim_af1 |
Specify the allelic fraction for CONDITION==1 |
--sim_af0 |
Specify the allelic fraction for CONDITION==0 |
--sim_cnv |
Add local CNVs to simulations |
--sim_cnv_allelic |
CNVs always impact one allele |
Parameter | |
---|---|
--indiv |
Also report the per-individual allele fractions (Warning, this can produce large files, typically used for visualization of one feature) |
--perm |
# of permutations to shuffle the allele labels (0=off) |
--perm_cond |
# of permutations to shuffle the condition labels (0=off) |
--seed |
Random seed (for simulations/permtuations) |