Skip to content

Latest commit

 

History

History
140 lines (88 loc) · 9.03 KB

README.md

File metadata and controls

140 lines (88 loc) · 9.03 KB

PEPS: Polygenic Epistatic Phenotype Simulation

Based on real genotype data. Polygenic Risk Score (PRS) models are proven to find better risk indicators for polygenic diseases. However, these models are limited to the additive effect of individual SNPs. What if a polygenic trait is formed in a more complex way where a superset of individual SNPs along with sets of interactive SNPs (Higher-Order Complex Epistasis Interactions) contribute to the phenotype. PEPS is developed to generate such complex phenotypes. The phenotype is simulated for real genotype data (i.e. 1000 Genomes Project) where all genomic pattern exists in the data (only the phenotype is simulated).

Why Simulation?

Whether such a complex phenotype exists in the real world is a research question. Before answering this question, we should find out the software that can utilize all individual and interactive SNPs all together. Once such software is identified, it can be used to process real-world datasets to form more accurate risk indicators for complex diseases.

What are Challenges?

There are two mathematical challenges to form such complex phenotypes: Producing a phenotype that depends on the additive effect of many variables. In the PEPS world, a variable could be either an SNP or a set of SNPs interact in a complex way (i.e. 2-way, 3-way ... n-way epistasis interaction). Producing each of the n-way epistasis interactions is a complex problem. Existing epistasis simulation software (EpiSim and GAMETES) simulate both phenotype and genotype. Yet GAMETES could not guarantee to form the phenotype.

How PEPS Works?

For more details see "PEPS Workflow" below. PEPS bypasses the above challenges by using a feedback loop simulation. In the first step, PEPS forms the variable that the user asked (i.e. 20 individual SNPs, 15 2-way interactive SNPs and 35 3-way interactive SNPs that is 20+15+35 variables in total). Then a randomly generated phenotype is assigned to the samples such that half of the sample are case and the other half are controls.

Next PEPS identifies the probability of being a case for each genotype of each variable. For SNP variables there are 3 genotypes 0/0 ( R), 0/1 (H) and 1/1 (A). For epistasis variables, the genotype is the concatenation of individual SNPs in the interaction. HRAH, HHAR, and RRAR are examples of genotype for a 4-way interaction variable. Thus an n-way interaction can have 3^n different genotype.

Finally, PEPS computes the probability of being a case for each sample by adding the probability of being a case for the genotype of that sample in all variables. Samples with the probability of being a case higher than average are assigned to the case group. Other samples are assigned to the control group.

What are the Drawbacks and How to Address them?

The feedback loop PEPS used for simulation does not guarantee to associate all variables with the phenotype. Thus the resulting phenotype would be a function of a set of variables PEPS starts with. To address this situation PEPS computes the chi2 p-value of each variable with the resulting phenotype. Only variables that exceed the p-value threshold (given by the user) are considered as truth variables. Subsequently, SNPs that form truth variables are called truth SNPs.

When using PEPS, the input genotype should not include SNPs that are associated with any population structure. For example, in the case of 1000 Genomes project we observe that if we include SNPs that can be a predictor of ethnicity, the resulting synthetic phenotype mimics the ethnicity of the population. To address this problem, you should exclude all SNPs associated with any population structure from the genotype file given to PEPS. Such SNP should not be a truth SNP for a synthetic phenotype.

Config file

This file is in JSON format and includes all the parameter for the simulation

  • inputType str: "csv" or "vcf"
  • dumpCSV boolean: In case the input is vcf save it in csv format (for faster access in later simulation)
  • shuffleSnps boolean: Shuffle SNPs before assigning them to variables.
  • outputPrefix str: Output file prefix
  • inputPrefix str: Input file prefix. Depending on inputType ".csv" or ".vcf" is added to the prefix.
  • pvalueThr float: Threshold for chi2 p-value of truth variable.
  • numTree int: Number of tree in RandomForest model to compute AUC
  • numLoop int: Number of times the simulation feedback loop is repeated
  • variables: array of
    • numSnpsInVar int: Type of variable. 1 for SNP variable, 2 for 2-way (2-SNP) interaction Variable and n for n-way (n-SNP) interaction Variable
    • numVar int: Number of variable of this type

Sample Data

The SampleData directory includes examples of input, output and config files. The HTML_Notebook contains processed PEPS notebooks in HTML format.

Input:

There are two input file both of them are a subset of 1000Genomes data available in vcf and csv format. The small input with ~162 SNPs and the large input with 4969 SNPs.

config:

There are 2 config files. config-small.json and config-large.json are made to process small vcf file and large csv file respectively.

Output:

Output files related to both config files above.

PEPS processed notebook

PEPS-small.html and PEPS-large.html are examples that show how PEPS notebook look likes after processing small and large input.

VCF to CSV conversion

To parse VCF files, PEPS uses a library called pdbio. This library is slow and only works for a tiny vcf file. We strongly recommend to prepare input data in CSV format. To do so you can use the VCF_2_CSV.sh script.

How to Run PEPS3

Use the PEPS3 Jupyter Notebook PEPS3.ipynb. We strongly recommend to keep eye on all the charts and intermediate data plotted in the notebook. The path to the config file is set in the first cell of the notebook.

PEPS3

PEPS3.ipynb uses a probabilistic model to calculate a genotype for a set of variables within a population.

Let the desired frequency of cases be Q and, similarly, the frequency of controls is P = 1-Q. If we let the kth variable on average decrease the probability of a sample being a control by the fraction qk then for g variables we arrive at:

Q = (1-q1)(1-q2) ... (1-qg)

In the interests of making each variable similar in importance, we set each qk equal to the same value q, which leaves us with Q = (1-q)g, and rearrange to find q = 1 - Q1/g. As q is an average over all the values for a variable, it can be expressed as

q = p1f1 + p2f2 + ... + pn+1fn+1

for n+1 possible values of the variable, where fk is the frequency of the kth variable. What remains is to choose the effect on the phenotype pk for each value of the variable. We choose to set the value that is most frequent in the population to 0, and set the others such that they all contribute an equal amount, i.e. p1f1 = p2f2 = ... = pnfn. So we have for pk

q = npkfk

pk = q/(nfk)

pk = (1-Q1/g)/nfk

This p is the calculated for each value of each variable (with a maximum value of 1) and represents the fraction by which the probability of a sample with that value being in the control group is reduced.

The probability that a sample s is in the control group is then

Qs = (1-*ps1)(1-*ps2) ... (1-psg)

where psk is the p associated with the value that sample s has for variable k.

The cases and controls could then be calculated stochastically, but in the interests of minimising noise and maximising the importance of each variable, the samples are sorted by increasing Qs and the first P fraction are selected as cases.

PEPS3 uses the "seed" parameter in the config file to ensure that a phenotype can be reproduced. If the "shuffleSnps" parameter is true the seed is ignored and the phenotype will be random.

An example config file: config-peps3.json

An example processed notebook: PEPS3.html

It also considers the SNPs included in Epistasis interactions also appear indivudually as well. For exampel if O3V5 made of 3 SNP (rs123, rs456, rs789) then each of these SNPs will form a Variable (O3V5S1, O3V5S2, O3V5S3) and independantly affect the phenotype

The paper

Reguant, R., O’Brien, M. J., Bayat, A., Hosking, B., Jain, Y., Twine, N. A., & Bauer, D. C. (2024). PEPS: Polygenic Epistatic Phenotype Simulation. In MEDINFO 2023—The Future Is Accessible (pp. 810-814). IOS Press.

Link: https://ebooks.iospress.nl/doi/10.3233/SHTI231077