Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulating multiple traits #54

Closed
jeromekelleher opened this issue Jul 15, 2023 · 9 comments
Closed

Simulating multiple traits #54

jeromekelleher opened this issue Jul 15, 2023 · 9 comments

Comments

@jeromekelleher
Copy link
Member

In #52 we added the sim_trait(ts, num_causal) function which returns a pandas data frame with num_causal rows. It seems to me that we will often want to simulate multiple traits, which perhaps we should try to support natively. I'm guessing (and correct me if I'm wrong!) that people will do things like this:

num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
for j in range(num_traits):
    df_trait = tstrait.sim_trait(ts, num_causal[j])
    # df_trait has cols  site_id, causal_state, and effect_size
    X = tstrait.sim_phenotypes(ts, df_trait)
   # Do something with X

A better pattern (and one more suitable for interoperability with things like SLiM etc might be:

num_traits = 100
# Choose number of causal sites uniformly for each trait
num_causal = np.random.uniform(10, 100, size=num_traits)
df_traits = tstrait.sim_traits(ts, num_causal)
# df_traits has cols trait_id,  site_id, causal_state, and effect_size
for trait_id, X in enumerate(tstrait.sim_phenotypes(ts, df_traits)):
    # Do something with X

The gain here is that we can communicate multiple traits at once - otherwise, it might get quite clunky when trying to work with many traits. The sim_phenotypes function returns an iterator over the traits and the generated phenotypes for each trait (much like msprime does for random seeds).

Thoughts?

@gregorgorjanc
Copy link
Member

I second that working on multiple traits at once is a way forward since that also enables straightforward simulation of correlated traits. Below are some thoughts on what can be done for a single trait at a time or multiple traits. I am not adding anything regarding the API though!

Genetic variance for a trait and covariance between traits is a function of two sources - 1) causal mutation effects and 2) genotypes at 1)

  1. We would want the ability to simulate correlated mutation effects for some loci, potentially all. This can be done:

a) for multiple traits at once (easiest and will involve Cholesky or some other transformation of a user-provided covariance matrix) or

b) for one trait at a time, but taking into account genetic covariance with previous traits through a Cholesky factor.

  1. Once we have 1) we can calculate genetic values of nodes or individuals, by applying causal mutation effects to corresponding genotypes; this can be done independently for each trait, possibly under the hood so it looks multi-trait to a user.

Finally, phenotypic variance for a trait and covariance between traits is a function of at least two sources - 1) genetic values of individuals simulated above and 2) environmental effects experienced by individuals. The latter can be again independent or correlated - through a user-provided environmental covariance matrix. Application of this environmental covariance can be again done:

a) for multiple traits at once (easiest and will involve Cholesky or some other transformation of a user-provided covariance matrix) or

b) for one trait at a time, but taking into account environmental covariance with previous traits through a Cholesky factor.

Above, I say "at least" because a user could be adding other effects on top of the genetic and environmental effects, say sex, year, location, etc. That part can be done after sim_phenotypes().

@daikitag
Copy link
Collaborator

@jeromekelleher
Thank you for your feedback on #55, and I think it is very interesting to try simulating multiple traits.
I guess we can simulate multiple traits by adding a new TraitModel that can simulate values based on a multivariate normal distribution, or to simulate multivariate normal random variables from Cholesky decomposition.

However, I think the biggest issue is to obtain mutiple trait results by using a Pandas dataframe. Would you suggest that we try adding new columns in the dataframe to represent simulated results from multiple traits, or to obtain multiple dataframes for each trait? I'm currently stuck with how I can produce the output, and I would appreciate your suggestions.

@gregorgorjanc
Copy link
Member

We usually have one wide object for multiple traits - for mutation effects and for node/individual values (I mean these are two multi-trait objects) - some fixed number of columns for site or individual ids and such metadata and then multiple columns for traits. That form makes it quite easy to work with in my view.

Depending on the implementation, you can also go with the “tall” object (speaking in tidyverse sense) where a mutation has multiple rows - one per trait, and you then need trait column and value column.

Multiple objects for multiple traits would get messy in my view. For example, it's common to calculate correlation between traits, so having traits in one objects makes that simple.

@jeromekelleher
Copy link
Member Author

I think a single "tall" data frame is the way to go. In the simplest case we can have columns:

  • trait_id
  • site_id
  • causal_state
  • effect_size

then when we call sim_phenotypes on this, it will do something like this (but moer efficiently!)

def sim_phenotypes(ts, df_traits, ...):
     trait_ids = df_traits.trait_id.unique()
     for trait_id in trait_ids:
           yield _sim_trait_phenotypes(ts, df_traits[df_traits.trait_id == trait_id])

So we generate the phenotypes for traits one-by-one. The correlations between traits can be handled in the sim_traits function, right?

@gregorgorjanc
Copy link
Member

I think a single "tall" data frame is the way to go. In the simplest case we can have columns:

  • trait_id
  • site_id
  • causal_state
  • effect_size

Should you link to a mutation ID or to site ID? If you link to a mutation ID, then you get site ID from there.

So we generate the phenotypes for traits one-by-one. The correlations between traits can be handled in the sim_traits function, right?

sim_traits is simulating mutation effects, which will take care of genetic correlations, but only the mutation effects part, so would it not be better to call this sim_mutation_effects (this one should be run for multiple traits to take genetic correlation into account)? Namely, genetic correlation between traits is a function of mutation effects AND genotypes, so genetic correlation between traits comes out only after sim_phenotypes is called.

sim_phenotypes is calculating genetic values (by applying mutation effects to genotypes) and on top of these simulating phenotype values by adding environmental noise. This noise can be correlated though! In light of this, you might want to have two functions here: calculate_genetic_values (this one can be run trait by trait) and sim_phenotypes (this one should be run for multiple traits to take environmental correlation into account).

@daikitag
Copy link
Collaborator

@gregorgorjanc Thank you very much for your detailed feedback. I also thoroughly looked at your lecture videos on AlphaSimR, https://media.ed.ac.uk/playlist/dedicated/76589901/1_os6ikczi/1_4il26qf1, and it was super helpful for me to understand the importance of quantitative trait simulation and multiple trait simulation in agriculture. Considering that the correlation is important in multiple trait simulation, I think doing the following simulation might be a good idea, but what do you think about it?

Input of sim_traits (Simulate effect sizes):

  • Tree sequence
  • Number of causal sites (We assume complete pleiotropy like AlphaSimR)
  • model (Only the Multivariate normal trait model supports multiple trait simulation)

Input of multivariate normal trait model:

  • Correlation matrix
  • Variance vector
  • Mean vector

Output of sim_traits (based on @jeromekelleher suggestions):
Dataframe with

  • trait_id
  • site_id
  • causal_state
  • effect_size

Here, we are considering site ID, as it would be possible for multiple mutations at a particular site to have the same derived state, and it would not be a good idea to consider them differently.

To simulate the phenotypes, I am thinking about inputting the vector of heritability, and simulating environmental values based on those values. It would be great to consider environmental correlation as well, but I think as @gregorgorjanc had suggested in slack, it might be better to allow users to define their own environmental noise by allowing heritability = 1 option, instead of us trying to implement a complex structure.

Do you think this sounds like a good idea? If so, I will try posting a pull request to implement this functionality to tstrait, and I really appreciate your time and help.

@gregorgorjanc
Copy link
Member

Input of sim_traits (Simulate effect sizes):

  • Tree sequence
  • Number of causal sites (We assume complete pleiotropy like AlphaSimR)
  • model (Only the Multivariate normal trait model supports multiple trait simulation)

I think these inputs look good.

Input of multivariate normal trait model:

  • Correlation matrix
  • Variance vector
  • Mean vector

Also good. Start with mean, then var, then corr;)

Output of sim_traits (based on @jeromekelleher suggestions): Dataframe with

  • trait_id
  • site_id
  • causal_state
  • effect_size

Here, we are considering site ID, as it would be possible for multiple mutations at a particular site to have the same derived state, and it would not be a good idea to consider them differently.

It depends. If we assume that a mutation, say from G to C at a site X, has the same effect irrespective of sequence context then this suffices. Most statistical genetics research is taking this view! I have been using this view for years! It's only natural to take this view if the only thing you see in front of you is a matrix of genotypes and you have no sequence context of where mutations happened.

Since I am working with tree sequences and it's notion of multiple mutations at a site, I have started to think that this view is possibly too simplistic. Let me explain with a simple example from protein-coding DNA - we know how that part of biology works very well:

  • if we mutate GGG codon (GLY amino acid) to CGG codon (mutating G to C at the 1st site) we get ARG amino acid
  • if we mutate GCG codon (ALA amino acid) to CCG codon (mutating G to C at the 1st site) we get PRO amino acid

Both of the above mutations happen at the same site, and they both mutate G to C; however, they are not the same mutation in terms of starting and resulting amino acids due to different sequence contexts! Assuming that this polymorphism (G vs C) has the same effect, imposes an average effect of this polymorphism across different sequence contexts. This is what we are estimating in GWAS and polygenic risk score studies, but note that in real data, we will likely not have an equal proportion of haplotypes (GGG, CGG, GCG, and CCG codons in the above example), meaning that the actual effect we estimate could be more influenced by the GGG->CGG mutation (if that one is more prevalent) than the GCG->CCG mutation. Maybe this partly explains why polygenic risk scores don't port well between populations? That's a huge research question. Technically speaking, by considering different mutation effects depending on sequence context, we are allowing for some form(s) of epistasis.

The above is quite a specific example and very much a research question. It would be neat if we could simulate one or the other scenario and later test how different estimation methods deal with these different settings. I understand you might want to implement the simple case of the same effect of a mutation at a site first! For one, the more complex case opens many questions - how do we simulate these variable effects depending on the sequence context and do all mutations have such "variable" effects for the same site etc.!?

To simulate the phenotypes, I am thinking about inputting the vector of heritability, and simulating environmental values based on those values. It would be great to consider environmental correlation as well, but I think as @gregorgorjanc had suggested in slack, it might be better to allow users to define their own environmental noise by allowing heritability = 1 option, instead of us trying to implement a complex structure.

I agree that the vector of h2 will take you far. It really depends on how versatile you want to make this simulator. Environmental correlations between traits are the norm though, as are genetic correlations;)

@daikitag daikitag mentioned this issue Jul 18, 2023
@daikitag
Copy link
Collaborator

I just created a new pull request to incorporate the things that were mentioned in this issue. I would appreciate it if you could look through the codes whenever you have some time.
#56

@daikitag
Copy link
Collaborator

The issue is incorporated into #56. Thank you @gregorgorjanc for your detailed suggestions, and they were super helpful for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants