Skip to content

Latest commit

 

History

History
175 lines (120 loc) · 10.7 KB

README.md

File metadata and controls

175 lines (120 loc) · 10.7 KB

Principal Interaction Component Analysis through Likelihood Optimization (PICALO)

Expression quantitative trait loci (eQTL) offer insights into the regulatory mechanisms of trait associated variants, but their effects often rely on contexts that are either unknown or unmeasured. We introduce PICALO, a method for hidden variable inference of eQTL contexts. PICALO identifies and disentangles technical from biological context in heterogeneous eQTL datasets.

How it works

PICALO is an expectation maximization (EM) based algorithm. Using an initial guess to start the optimization, PICALO identifies eQTLs that interact with this context (interaction eQTLs; ieQTLs) and subsequently optimizes this interaction.

PICALO applied to a single ieQTL

Per sample the context value is adjusted to maximize the log-likelihood of the whole model (i.e. minimize the residuals) while maintaining the model parameters. The underlying assumption is that while the per sample context value might be noisy, the whole model captures the true interaction. Considering the case in which only one ieQTL is optimized, this results in all samples moving towards their genotype's regression line:

GraphicalAbstractPart1.mp4

A collection of ieQTLs are optimized simultaneously

In practice, however, PICALO applies this optimization over all ieQTLs simultaneously. The optimal value is identified by modelling the log-likelihood function for different context values (x-axis) as a second-degree polynomial. By summing the polynomials, a new context value is identified per sample that maximizes the log-likelihood (i.e. minimized the residuals) over all ieQTLs. Repeat this for all samples and this will yield a new vector that maximally interacts with the ieQTLs.

GraphicalAbstractPart2.mp4

Process is repeated to identify Principal Interaction Components (PICs)

This process is then repeated using the optimized vector as input and reoptimized over an updated set of ieQTLs. Generally, with every iteration the number of ieQTLs increases. Shown below is that after a limited number of iterations an optimum can be reached. Once converged, the resulting vector is referred to as Principal Interaction Components (PICs).

GraphicalAbstractPart3.mp4

Finally, multiple PICs are identified by removing the previous PICs and repeating the process.

Prerequisites

This program is written in Python v3.7.4. The program requires the following packages to be installed:

  • numpy (v1.19.5)
  • pandas (v1.2.1)
  • scipy (v1.6.0)
  • statsmodels (v0.12.2)
  • matplotlib (v3.3.4)
  • seaborn (v0.11.1)

See 'Installing' on how to install these packages. Note that the performance / reliability of this program on other versions is not guaranteed.

Installing

Install the required packages. Consider using a virtual environment to ensure the right version of packages is used.

python3 -m venv <name of virtual environment>
source <name of virtual environment>/bin/activate

The python packager manager (pip) and the requirements file can be used to install all the necessary packages. Note that the requirements.txt file includes dependencies with their correct versions. Therefore, include the flag --no-dependencies when installing the packages to prevent unnecessary upgrading.

pip install -r requirements.txt --no-dependencies

You can exit the virtual environment by typing:

deactivate

Usage

./picalo.py -h
Information arguments
  • -h, --help: show this help message and exit
  • -v, --version: show program's version number and exit
Required data arguments
  • -eq, --eqtl: The path to the eqtl matrix. Expects to contain the columns 'SNPName' (the eQTL SNP) and 'ProbeName' (the eQTL gene).

  • -ge, --genotype: The path to the genotype matrix. Expects to contain genotype dosages (between 0 en 2). Missing genotypes are by default -1 (can be changed used -na / --genotype_na). The rows should contain the SNPs on the same order as the -eq / --eqtl files. The columns should contain the samples on the same order as the -ex / --expression and -co / --covariate files.

  • -ex, --expression: The path to the expression matrix. The rows should contain the gene expression on the same order as the -eq / --eqtl files. The columns should contain the samples on the same order as the -ge / --genotype and -co / --covariate files.

  • -co, --covariate: The path to the covariate matrix (i.e. the matrix used as initial guess for the optimization). The rows should contain the different covariates on the same order as the -eq / --eqtl files. The columns should contain the samples on the same order as the -ge / --genotype and -ex / --expression files.

Optional data arguments
  • -na, --genotype_na: The genotype value that equals a missing value. Default: -1.

  • -tc, --tech_covariate: The path to the technical covariate matrix (excluding an interaction with genotype). Default: None. The rows should contain the technical covariates to correct for. The columns should contain the samples on the same order as the -ge / --genotype, -ex / --expression file, and -co / --covariate files.

  • -tci, --tech_covariate_with_inter: The path to the technical covariate matrix(including an interaction with genotype). Default: None. The rows should contain the technical covariates to correct for including an interaction term with genotype. The columns should contain the samples on the same order as the -ge / --genotype, -ex / --expression file, and -co / --covariate files.

  • -std, --sample_to_dataset: The path to the sample-dataset link matrix. Default: None. Note that his argument is required if the input data conists of multiple datasets. The rows should contain sample - dataset links on the same order as the -ge / --genotype, -ex / --expression file, and -co / --covariate files.

eQTL inclusion arguments
  • -mds, --min_dataset_size: The minimal number of samples per dataset. Default: >=30.
  • -ea, --eqtl_alpha: The eQTL significance cut-off. Default: <=0.05.
  • -cr, --call_rate: The minimal call rate of a SNP (per dataset).Equals to (1 - missingness). Default: >= 0.95.
  • -hw, --hardy_weinberg_pvalue: The Hardy-Weinberg p-value threshold.Default: >= 1e-4.
  • -maf, --minor_allele_frequency: The MAF threshold. Default: >0.01.
  • -mgs, --min_group_size: The minimal number of samples per genotype group. Default: >= 2.
  • -iea, --ieqtl_alpha: The interaction eQTL significance cut-off. Default: <=0.05.
PICALO optimization arguments
  • -n_components: The number of components to extract. Default: 10.
  • -min_iter: The minimum number of optimization iterations to perform per component. Default: 5.
  • -max_iter: The maximum number of optimization iterations to perform per component. Default: 100.
  • -tol: The convergence threshold. The optimization will stop when the 1 - Pearson correlation coefficient is below this threshold. Default: 1e-3.
  • -force_continue: Force to identify more components even if the previous one did not converge. Default: False.
Output arguments
  • -o, --outdir: The name of the output folder.
  • -verbose: Enable verbose output. Default: False.

Output

PICALO generates a set of general and PIC specific output files. Click on the headers for more information.

General output files
  • call_rate.txt.gz: containing the per dataset SNP call rates.
  • components.txt.gz: containing the identified components (also including non-converged components in contrast to 'PICs.txt.gz').
  • genotype_stats.txt.gz: containing the genotype summary statistics like sample size, Hardy-Weinberg equilibrium p-value, minor allele frequency (MAF), etc.
  • log.log: log file containing all terminal output.
  • PICs.txt.gz: containing the identified PICs.
  • SummaryStats.txt.gz: containing the number of interacting eQTLs per PIC.
PIC specific output files
  • PICN/component.npy: numpy binary file containing the converged PIC for easy loading in case of a restart.
  • PICN/covariate_selection.txt.gz: containing the number of interacting eQTLs per initial guess (referred to as covariate).
  • PICN/info.txt.gz: containing optimization statistics per iteration like the number of interacting eQTLs, the correlation with the previous iteration etc.
  • PICN/iteration.txt.gz: containing the component loadings per iteration.
  • PICN/n_hits_per_sample.txt.gz: containing the number of eQTLs used for optimization per sample per iteration.
  • PICN/results_iterationN.txt.gz: containing the interaction eQTL summary statistics per iteration.

Finally, the number of interacting eQTLs is determined without removing PICs. The summary statistics of which are stored as: PIC_interactions/PICN.txt.gz.

Github Content

  • src: the source code for the PICALO program
  • dev: the source code used for development
  • dev/*/plot_scripts: scripts used to visualise the output of PICALO
  • dev/*/postprocess_scripts: script used for post-processing the PICALO output (i.e. downstream analyses)
  • dev/*/preprocess_scripts: script used for pre-processing the input data
  • dev/*/presentation_scripts: script used for creating plots for presentations
  • dev/*/test_scripts: scripts used to test things / visualise stuff / etc.

The files are split between general and dataset specific scripts (BIOS / MetaBrain).

How to cite

Please cite: Vochteloo, M., Deelen, P., Vink, B. et al. PICALO: principal interaction component analysis for the identification of discrete technical, cell-type, and environmental factors that mediate eQTLs. Genome Biol 25, 29 (2024). https://doi.org/10.1186/s13059-023-03151-0

Author

Martijn Vochteloo ([email protected]) (1)

  1. Department of Genetics, University Medical Center Groningen, University of Groningen, Hanzeplein 1, Groningen, The Netherlands

License

This project is licensed under the BSD 3-Clause "New" or "Revised" License - see the LICENSE file for details