-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulating multiple traits #54
Comments
I second that working on multiple traits at once is a way forward since that also enables straightforward simulation of correlated traits. Below are some thoughts on what can be done for a single trait at a time or multiple traits. I am not adding anything regarding the API though! Genetic variance for a trait and covariance between traits is a function of two sources - 1) causal mutation effects and 2) genotypes at 1)
a) for multiple traits at once (easiest and will involve Cholesky or some other transformation of a user-provided covariance matrix) or b) for one trait at a time, but taking into account genetic covariance with previous traits through a Cholesky factor.
Finally, phenotypic variance for a trait and covariance between traits is a function of at least two sources - 1) genetic values of individuals simulated above and 2) environmental effects experienced by individuals. The latter can be again independent or correlated - through a user-provided environmental covariance matrix. Application of this environmental covariance can be again done: a) for multiple traits at once (easiest and will involve Cholesky or some other transformation of a user-provided covariance matrix) or b) for one trait at a time, but taking into account environmental covariance with previous traits through a Cholesky factor. Above, I say "at least" because a user could be adding other effects on top of the genetic and environmental effects, say sex, year, location, etc. That part can be done after |
@jeromekelleher However, I think the biggest issue is to obtain mutiple trait results by using a Pandas dataframe. Would you suggest that we try adding new columns in the dataframe to represent simulated results from multiple traits, or to obtain multiple dataframes for each trait? I'm currently stuck with how I can produce the output, and I would appreciate your suggestions. |
We usually have one wide object for multiple traits - for mutation effects and for node/individual values (I mean these are two multi-trait objects) - some fixed number of columns for site or individual ids and such metadata and then multiple columns for traits. That form makes it quite easy to work with in my view. Depending on the implementation, you can also go with the “tall” object (speaking in tidyverse sense) where a mutation has multiple rows - one per trait, and you then need trait column and value column. Multiple objects for multiple traits would get messy in my view. For example, it's common to calculate correlation between traits, so having traits in one objects makes that simple. |
I think a single "tall" data frame is the way to go. In the simplest case we can have columns:
then when we call def sim_phenotypes(ts, df_traits, ...):
trait_ids = df_traits.trait_id.unique()
for trait_id in trait_ids:
yield _sim_trait_phenotypes(ts, df_traits[df_traits.trait_id == trait_id]) So we generate the phenotypes for traits one-by-one. The correlations between traits can be handled in the |
Should you link to a mutation ID or to site ID? If you link to a mutation ID, then you get site ID from there.
|
@gregorgorjanc Thank you very much for your detailed feedback. I also thoroughly looked at your lecture videos on AlphaSimR, https://media.ed.ac.uk/playlist/dedicated/76589901/1_os6ikczi/1_4il26qf1, and it was super helpful for me to understand the importance of quantitative trait simulation and multiple trait simulation in agriculture. Considering that the correlation is important in multiple trait simulation, I think doing the following simulation might be a good idea, but what do you think about it? Input of
Input of multivariate normal trait model:
Output of
Here, we are considering site ID, as it would be possible for multiple mutations at a particular site to have the same derived state, and it would not be a good idea to consider them differently. To simulate the phenotypes, I am thinking about inputting the vector of heritability, and simulating environmental values based on those values. It would be great to consider environmental correlation as well, but I think as @gregorgorjanc had suggested in slack, it might be better to allow users to define their own environmental noise by allowing Do you think this sounds like a good idea? If so, I will try posting a pull request to implement this functionality to |
I think these inputs look good.
Also good. Start with mean, then var, then corr;)
It depends. If we assume that a mutation, say from G to C at a site X, has the same effect irrespective of sequence context then this suffices. Most statistical genetics research is taking this view! I have been using this view for years! It's only natural to take this view if the only thing you see in front of you is a matrix of genotypes and you have no sequence context of where mutations happened. Since I am working with tree sequences and it's notion of multiple mutations at a site, I have started to think that this view is possibly too simplistic. Let me explain with a simple example from protein-coding DNA - we know how that part of biology works very well:
Both of the above mutations happen at the same site, and they both mutate G to C; however, they are not the same mutation in terms of starting and resulting amino acids due to different sequence contexts! Assuming that this polymorphism (G vs C) has the same effect, imposes an average effect of this polymorphism across different sequence contexts. This is what we are estimating in GWAS and polygenic risk score studies, but note that in real data, we will likely not have an equal proportion of haplotypes (GGG, CGG, GCG, and CCG codons in the above example), meaning that the actual effect we estimate could be more influenced by the GGG->CGG mutation (if that one is more prevalent) than the GCG->CCG mutation. Maybe this partly explains why polygenic risk scores don't port well between populations? That's a huge research question. Technically speaking, by considering different mutation effects depending on sequence context, we are allowing for some form(s) of epistasis. The above is quite a specific example and very much a research question. It would be neat if we could simulate one or the other scenario and later test how different estimation methods deal with these different settings. I understand you might want to implement the simple case of the same effect of a mutation at a site first! For one, the more complex case opens many questions - how do we simulate these variable effects depending on the sequence context and do all mutations have such "variable" effects for the same site etc.!?
I agree that the vector of |
I just created a new pull request to incorporate the things that were mentioned in this issue. I would appreciate it if you could look through the codes whenever you have some time. |
The issue is incorporated into #56. Thank you @gregorgorjanc for your detailed suggestions, and they were super helpful for me. |
In #52 we added the
sim_trait(ts, num_causal)
function which returns a pandas data frame with num_causal rows. It seems to me that we will often want to simulate multiple traits, which perhaps we should try to support natively. I'm guessing (and correct me if I'm wrong!) that people will do things like this:A better pattern (and one more suitable for interoperability with things like SLiM etc might be:
The gain here is that we can communicate multiple traits at once - otherwise, it might get quite clunky when trying to work with many traits. The
sim_phenotypes
function returns an iterator over the traits and the generated phenotypes for each trait (much like msprime does for random seeds).Thoughts?
The text was updated successfully, but these errors were encountered: