Replies: 6 comments 1 reply
-
I like the idea of having tried and tested models, but I think including them in the manual is likely a problem, because it will require constant updating and is then liable to get out of sync with the software itself. However, we pretty much already have a list of tried and tested models, and that's what's included by default in Implementing and testing new optimisers for new models like +I+H[n] is something that would need to be funded by a grant. It takes a lot of developer time to get that right, and a lot of time to then prove that it works well enough for biological inference. |
Beta Was this translation helpful? Give feedback.
-
Let me follow up on the issue of tried-and-tested RHAS models.
The output of the --help command lists the following RHAS models as options:
RATE HETEROGENEITY AMONG SITES:
-m ...+I A proportion of invariable sites
-m ...+G[n] Discrete Gamma model with n categories (default n=4)
-m ...*G[n] Discrete Gamma model with unlinked model parameters
-m ...+I+G[n] Invariable sites plus Gamma model with n categories
-m ...+R[n] FreeRate model with n categories (default n=4)
-m ...*R[n] FreeRate model with unlinked model parameters
-m ...+I+R[n] Invariable sites plus FreeRate model with n categories
-m ...+Hn Heterotachy model with n classes
-m ...*Hn Heterotachy model with n classes and unlinked parameters
so it is fair to assume that these nine RHAS models are tried-and-tested (I certainly assumed so).
This brings me to several related issues:
1. When I used the -MF option (i.e., without the --mrate option), I only got results from five RHAS models: I, G, I+G, R, and I+R. This is a subset of the models above. Does this mean that the *G, *R, H, and *H models should not be considered as tried-and-tested?
2. The results included in the spreadsheet attached to the original message suggest that optimisation is not working properly for the *G, *R, and *H RHAS models. It may be sensible to remove these models from the list above (until a computational solution has been implemented for these models).
3. As for the H model, there was no indication suggesting that it is not working properly, so should it be included in the list of models considered when the -MF option is used on its own (i.e., without the --mrate option)? I would welcome that.
4. In my original survey, I included RHAS models not include in the list above (i.e., I+H, I*G, I*R, I*H). I did so because I wanted to know if they work (I had hoped so). However, they returned odd results, so I will stop using them, even though they are interesting from a process point of view.
5. Finally, I compared four model-selection procedures and got the following optimal models:
* -MF : GTR+F+I+R4 (BIC: 262633.915)
* -MF --mtree : GTR+F+I+R4 (BIC: 262629.659)
* -MF --mrate E,I,G,I+G,R,I+R,H : GTR+F+H4 (BIC: 262104.525)
* -MF --mrate E,I,G,I+G,R,I+R,H –mtree : GTR+F+H4 (BIC: 262105.060)
These results suggest, like I have seen many time previously, that we should include as many of the RHAS models as possible during model selection, and that we encourage users to use the --mtree option because the tree topology determines whether a given site is fast-evolving or slow-evolving. One option would be to use the -MF option with the inclusion of the H model and to create a new option called -MFA, which combines the -MF and --mtree options.
1. I agree it is time-consuming and expensive to incorporate and test the accuracy of new RHAS models, but it wouldn’t take a lot of time to revise the code so that:
* The -MF option included the following RHAS models: E, I, G, I+G, R, I+R, and H
* The *G, *R, and *H models be removed from the list above – my survey suggests they give misleading results, which is extremely unfortunate because I’d love to see them work efficiently and accurately
Enough for now.
All the best,
Lars
From: roblanf ***@***.***>
Date: Tuesday, 11 June 2024 at 07:36
To: iqtree/iqtree2 ***@***.***>
Cc: Jermiin, Lars ***@***.***>, Author ***@***.***>
Subject: Re: [iqtree/iqtree2] Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA (Discussion #224)
EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe.
RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte.
I like the idea of having tried and tested models, but I think including them in the manual is likely a problem, because it will require constant updating and is then liable to get out of sync with the software itself. However, we pretty much already have a list of tried and tested models, and that's what's included by default in -m MF. These could easily be listed via --help, and/or in the logfile when using ModelFinder.
Implementing and testing new optimisers for new models like +I+H[n] is something that would need to be funded by a grant. It takes a lot of developer time to get that right, and a lot of time to then prove that it works well enough for biological inference.
—
Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXEE5XZUDSVVIHGA3NYDZG2EEHAVCNFSM6AAAAABJADQ762VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TOMZUGQZDG>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
To be honest, I object to include +H model into the default ModelFinder. For most users this model is an over-parameterisation, unless they have a very long alignment. |
Beta Was this translation helpful? Give feedback.
-
Just back from leave –
OK – but by excluding the +H model means that model selection only considers homotachous models of RHAS. For short alignments (with say 300 sites of amino acids or 900 sites of nucleotides) that may be OK, but for longer alignments, I doubt it.
What is the evidence behind saying that model “selection is generally insensitive to the tree used”? The original paper suggested otherwise.
From: Bui Quang Minh ***@***.***>
Date: Friday, 14 June 2024 at 15:09
To: iqtree/iqtree2 ***@***.***>
Cc: Jermiin, Lars ***@***.***>, Author ***@***.***>
Subject: Re: [iqtree/iqtree2] Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA (Discussion #224)
EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe.
RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte.
To be honest, I object to include +H model into the default ModelFinder. For most users this model is an over-parameterisation, unless they have a very long alignment.
I also object to encourage the use of -mtree option. Model selection is generally insensitive to the tree used. Doing tree search for every single model is likely a waste of computing and energy consumption.
—
Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXEB5CSTYUTMF2TIYRXDZHL2PNAVCNFSM6AAAAABJADQ762VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TONZVGI3TO>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
What literature are you referring to? Most of the literature that I am aware of on model selection only compares homotachous models of RHAS, so your argument is not convincing me. The same applies to the --tree option.
Lars
From: roblanf ***@***.***>
Date: Tuesday, 18 June 2024 at 03:59
To: iqtree/iqtree2 ***@***.***>
Cc: Jermiin, Lars ***@***.***>, Author ***@***.***>
Subject: Re: [iqtree/iqtree2] Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA (Discussion #224)
EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe.
RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte.
I agree here. The +H models are a special case that need to be (very) carefully examined whenever they are used. Ditto the tree search for every model. Given the information in the literature on the relatively minor differences that model selection makes to inference, I lean a long way in the other direction. We should test a small and sensible set of models, in a way that is quick and efficient. (While, of course, allowing users to do more computationally demanding things if they so choose, with options like -mtree).
To put it another way, I'd need to see some overwhelming evidence that phylogenetic inference is dramatically improved by things like -mtree, on a huge range of typical empirical datasets, before I would be convinced that it should be a default. The kind of thing that would start to convince me is if a large fraction of datasets showed differences in highly supported nodes when using that option.
—
Reply to this email directly, view it on GitHub<#224 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXEHLUEW4BUPBFMYSPITZH6PA7AVCNFSM6AAAAABJADQ762VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQMBQHEZDK>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi guys,
Prompted by our discussion on models of sequence evolution (SE) are supported by IQTREE 2 and ModelFinder, I did a survey of the models of SE using an alignment of mtDNA. The following describes the data, a definition (of "model of SE"), and what I did, found, and recommend be done.
DATA
Alignment has 24 sequences with 13421 columns, 5537 distinct patterns
6130 parsimony-informative, 1226 singleton sites, 6065 constant sites
DEFINITION
A model of SE combines a substitution (S) model and a rate-heterogeneity across sites (RHAS) model. Hence, HKY is included in S, and I+G is included in RHAS.
WHAT I DID
To examine the performance of ModelFinder, I ran it twice, once in default mode (-m MF) and once in advanced mode (-m MF --mtree). This meant that for each model of SE, I simply had to compare the log likelihood values to see whether using the advance mode improved the fit between tree, model, and data. It also meant that I did not have to use BIC or AIC.
To evaluate all RHAS models, I specified them (i.e. I used models that I could think of). I used the following four commands:
I could have used two commands but I chose to use four because I wanted to challenge a concern that I have about models with unlinked parameters.
I realise that some of these models of RHAS might not be supported, but I was not sure which ones are OK and which ones not.
WHAT I FOUND
The results are summarised in a spreadsheet called Comparison_LSJermiin.xlsx (attached; the log file can be provided if needed).
Comparison_by_LSJermiin.xlsx
Each line in the file contains some estimates and observations for a model of SE:
Focusing on the results obtained using the E,I,G,I+G,R,I+R,H,I+H RHAS models (i.e., lines 3-417 of the attached file), several features are clear:
These observations suggest that
Focusing on the results obtained using the G,IG,R,IR,H,IH RHAS models (i.e., lines 418-812), several features are clear:
In summary, I am now quite worried about using models where model parameters are unlinked). I realise that some of the models might not have been tested properly or may not even be supported by the IQTREE 2, but given the lack of clear guidelines about 'tried-and-tested' combinations of S and RHAS, I thought it better to test all those listed above.
RECOMENDATION
I would really like to see:
As always, I'll be happy to help running further tests
Beta Was this translation helpful? Give feedback.
All reactions