Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA #224

lsjermiin · 2024-06-08T16:35:12Z

lsjermiin
Jun 8, 2024

Hi guys,
Prompted by our discussion on models of sequence evolution (SE) are supported by IQTREE 2 and ModelFinder, I did a survey of the models of SE using an alignment of mtDNA. The following describes the data, a definition (of "model of SE"), and what I did, found, and recommend be done.

DATA
Alignment has 24 sequences with 13421 columns, 5537 distinct patterns
6130 parsimony-informative, 1226 singleton sites, 6065 constant sites

DEFINITION
A model of SE combines a substitution (S) model and a rate-heterogeneity across sites (RHAS) model. Hence, HKY is included in S, and I+G is included in RHAS.

WHAT I DID
To examine the performance of ModelFinder, I ran it twice, once in default mode (-m MF) and once in advanced mode (-m MF --mtree). This meant that for each model of SE, I simply had to compare the log likelihood values to see whether using the advance mode improved the fit between tree, model, and data. It also meant that I did not have to use BIC or AIC.

To evaluate all RHAS models, I specified them (i.e. I used models that I could think of). I used the following four commands:

iqtree2 -s 24concat_masked.fst --seqtype DNA -m MF --mrate E,I,G,I+G,R,I+R,H,I+H -cmax 15 --merit BIC --safe -T 16
iqtree2 -s 24concat_masked.fst --seqtype DNA -m MF --mrate E,I,G,I+G,R,I+R,H,I+H -cmax 15 --mtree --merit BIC --safe -T 16

iqtree2 -s 24concat_masked.fst --seqtype DNA -m MF --mrate *G,I*G,*R,I*R,*H,I*H -cmax 15 --merit BIC --safe -T 16
iqtree2 -s 24concat_masked.fst --seqtype DNA -m MF --mrate *G,I*G,*R,I*R,*H,I*H -cmax 15 --mtree --merit BIC --safe -T 16

I could have used two commands but I chose to use four because I wanted to challenge a concern that I have about models with unlinked parameters.

I realise that some of these models of RHAS might not be supported, but I was not sure which ones are OK and which ones not.

WHAT I FOUND
The results are summarised in a spreadsheet called Comparison_LSJermiin.xlsx (attached; the log file can be provided if needed).

Comparison_by_LSJermiin.xlsx

Each line in the file contains some estimates and observations for a model of SE:

Columns A and D list the model of SE,
Columns C and E list the log likelihood of best model of SE found using the default and advanced modes, respectively,
Column G list the difference in log likelihood values (a positive value implies that the advanced version returned a more likely estimate),
Column F list the number of parameters, and
Column C list the warnings issued using the default mode.

Focusing on the results obtained using the E,I,G,I+G,R,I+R,H,I+H RHAS models (i.e., lines 3-417 of the attached file), several features are clear:

Most differences in log likelihood are positive. This implies that the optimal estimate inferred in advanced mode led to an improvement in the fit between tree, model and data. This was as I had expected it
Some differences in log likelihood are negative. This was not expected but can occur if the optimisation procedure is caught in a local optimum.
Of the 31 cases with a negative difference, 27 are associated with a warning (WARN: ...), and the following RHAS model: I+H[n]. Interestingly, when this warning was issued, the difference in log likelihood could could be quite large (i.e., it range from -79.241 to 136.644)

These observations suggest that

The E,I,G,I+G,R,I+R,H RHAS models work well for the S models considered
The I+H RHAS model does not work for some of the S models considered

Focusing on the results obtained using the G,IG,R,IR,H,IH RHAS models (i.e., lines 418-812), several features are clear:

Of the 395 comparisons, 176 returned a positive difference in log likelihood. Of these cases, 41 came with a warning. Critically, the difference in log likelihood for these 176 models ranged from 7.914 to 29139.381 (based on the same number of parameters). This makes these models of RHAS very attractive
Of the 395 comparisons, 85 returned a negative difference in log likelihood. Of these cases, 33 came a warning. Critically, the difference in log likelihood for these 85 models ranged from -11.372 to -17463.21 (based on the same number of parameters). Although negative differences are possible, the proportion and size of these values raise concern about these RHAS models (or the implementation of these RHAS models)
As for the remaining 135 models of SE, the model was not considered by the default mode or the advanced mode. Of these 135 cases, 53 warning was issued (while using the default mode; I didn't count the number issued while using the advanced mode)
Looking at the cases where warnings were issued and/or negative differences in log likelihood occurred, it appears that every RHAS model considered (i.e., G, IG, R[n], IR[n], H[n], I*H[n]) raised cause for concern.

In summary, I am now quite worried about using models where model parameters are unlinked). I realise that some of the models might not have been tested properly or may not even be supported by the IQTREE 2, but given the lack of clear guidelines about 'tried-and-tested' combinations of S and RHAS, I thought it better to test all those listed above.

RECOMENDATION
I would really like to see:

The manual states which of the following RHAS models have been tested and, importantly, which ones should not be used: E, I, G, I+G, R, I+R, H, I+H, G ,IG ,R, IR, H, and IH [this would be first and easy solution]
The software lists the valid (i.e., the tried-and-tested) RHAS models whenever the iqtree2 --help command is used [this would be second and easy solution]
The not yet tried-and-tested RHAS models be implemented in the software (e.g., I+H[n] is a realistic RHAS model) [this would be third and more time-consuming solution]

As always, I'll be happy to help running further tests

roblanf · 2024-06-11T05:35:43Z

roblanf
Jun 11, 2024
Maintainer

I like the idea of having tried and tested models, but I think including them in the manual is likely a problem, because it will require constant updating and is then liable to get out of sync with the software itself. However, we pretty much already have a list of tried and tested models, and that's what's included by default in -m MF. These could easily be listed via --help, and/or in the logfile when using ModelFinder.

Implementing and testing new optimisers for new models like +I+H[n] is something that would need to be funded by a grant. It takes a lot of developer time to get that right, and a lot of time to then prove that it works well enough for biological inference.

0 replies

lsjermiin · 2024-06-13T08:24:45Z

lsjermiin
Jun 13, 2024
Author

Let me follow up on the issue of tried-and-tested RHAS models. The output of the --help command lists the following RHAS models as options: RATE HETEROGENEITY AMONG SITES: -m ...+I A proportion of invariable sites -m ...+G[n] Discrete Gamma model with n categories (default n=4) -m ...*G[n] Discrete Gamma model with unlinked model parameters -m ...+I+G[n] Invariable sites plus Gamma model with n categories -m ...+R[n] FreeRate model with n categories (default n=4) -m ...*R[n] FreeRate model with unlinked model parameters -m ...+I+R[n] Invariable sites plus FreeRate model with n categories -m ...+Hn Heterotachy model with n classes -m ...*Hn Heterotachy model with n classes and unlinked parameters so it is fair to assume that these nine RHAS models are tried-and-tested (I certainly assumed so). This brings me to several related issues: 1. When I used the -MF option (i.e., without the --mrate option), I only got results from five RHAS models: I, G, I+G, R, and I+R. This is a subset of the models above. Does this mean that the *G, *R, H, and *H models should not be considered as tried-and-tested? 2. The results included in the spreadsheet attached to the original message suggest that optimisation is not working properly for the *G, *R, and *H RHAS models. It may be sensible to remove these models from the list above (until a computational solution has been implemented for these models). 3. As for the H model, there was no indication suggesting that it is not working properly, so should it be included in the list of models considered when the -MF option is used on its own (i.e., without the --mrate option)? I would welcome that. 4. In my original survey, I included RHAS models not include in the list above (i.e., I+H, I*G, I*R, I*H). I did so because I wanted to know if they work (I had hoped so). However, they returned odd results, so I will stop using them, even though they are interesting from a process point of view. 5. Finally, I compared four model-selection procedures and got the following optimal models: * -MF : GTR+F+I+R4 (BIC: 262633.915) * -MF --mtree : GTR+F+I+R4 (BIC: 262629.659) * -MF --mrate E,I,G,I+G,R,I+R,H : GTR+F+H4 (BIC: 262104.525) * -MF --mrate E,I,G,I+G,R,I+R,H –mtree : GTR+F+H4 (BIC: 262105.060) These results suggest, like I have seen many time previously, that we should include as many of the RHAS models as possible during model selection, and that we encourage users to use the --mtree option because the tree topology determines whether a given site is fast-evolving or slow-evolving. One option would be to use the -MF option with the inclusion of the H model and to create a new option called -MFA, which combines the -MF and --mtree options. 1. I agree it is time-consuming and expensive to incorporate and test the accuracy of new RHAS models, but it wouldn’t take a lot of time to revise the code so that: * The -MF option included the following RHAS models: E, I, G, I+G, R, I+R, and H * The *G, *R, and *H models be removed from the list above – my survey suggests they give misleading results, which is extremely unfortunate because I’d love to see them work efficiently and accurately Enough for now. All the best, Lars From: roblanf ***@***.***> Date: Tuesday, 11 June 2024 at 07:36 To: iqtree/iqtree2 ***@***.***> Cc: Jermiin, Lars ***@***.***>, Author ***@***.***> Subject: Re: [iqtree/iqtree2] Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA (Discussion #224) EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe. RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte. I like the idea of having tried and tested models, but I think including them in the manual is likely a problem, because it will require constant updating and is then liable to get out of sync with the software itself. However, we pretty much already have a list of tried and tested models, and that's what's included by default in -m MF. These could easily be listed via --help, and/or in the logfile when using ModelFinder. Implementing and testing new optimisers for new models like +I+H[n] is something that would need to be funded by a grant. It takes a lot of developer time to get that right, and a lot of time to then prove that it works well enough for biological inference. — Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXEE5XZUDSVVIHGA3NYDZG2EEHAVCNFSM6AAAAABJADQ762VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TOMZUGQZDG>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

bqminh · 2024-06-14T14:08:32Z

bqminh
Jun 14, 2024
Maintainer

To be honest, I object to include +H model into the default ModelFinder. For most users this model is an over-parameterisation, unless they have a very long alignment.
I also object to encourage the use of -mtree option. Model selection is generally insensitive to the tree used. Doing tree search for every single model is likely a waste of computing and energy consumption.

1 reply

roblanf Jun 18, 2024
Maintainer

I agree here. The +H models are a special case that need to be (very) carefully examined whenever they are used. Ditto the tree search for every model. Given the information in the literature on the relatively minor differences that model selection makes to inference, I lean a long way in the other direction. We should test a small and sensible set of models, in a way that is quick and efficient. (While, of course, allowing users to do more computationally demanding things if they so choose, with options like -mtree).

To put it another way, I'd need to see some overwhelming evidence that phylogenetic inference is dramatically improved by things like -mtree, on a huge range of typical empirical datasets, before I would be convinced that it should be a default. The kind of thing that would start to convince me is if a large fraction of datasets showed differences in highly supported nodes when using that option.

lsjermiin · 2024-06-28T13:01:07Z

lsjermiin
Jun 28, 2024
Author

Just back from leave – OK – but by excluding the +H model means that model selection only considers homotachous models of RHAS. For short alignments (with say 300 sites of amino acids or 900 sites of nucleotides) that may be OK, but for longer alignments, I doubt it. What is the evidence behind saying that model “selection is generally insensitive to the tree used”? The original paper suggested otherwise. From: Bui Quang Minh ***@***.***> Date: Friday, 14 June 2024 at 15:09 To: iqtree/iqtree2 ***@***.***> Cc: Jermiin, Lars ***@***.***>, Author ***@***.***> Subject: Re: [iqtree/iqtree2] Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA (Discussion #224) EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe. RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte. To be honest, I object to include +H model into the default ModelFinder. For most users this model is an over-parameterisation, unless they have a very long alignment. I also object to encourage the use of -mtree option. Model selection is generally insensitive to the tree used. Doing tree search for every single model is likely a waste of computing and energy consumption. — Reply to this email directly, view it on GitHub<#224 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXEB5CSTYUTMF2TIYRXDZHL2PNAVCNFSM6AAAAABJADQ762VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TONZVGI3TO>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

lsjermiin · 2024-06-28T13:10:44Z

lsjermiin
Jun 28, 2024
Author

What literature are you referring to? Most of the literature that I am aware of on model selection only compares homotachous models of RHAS, so your argument is not convincing me. The same applies to the --tree option. Lars From: roblanf ***@***.***> Date: Tuesday, 18 June 2024 at 03:59 To: iqtree/iqtree2 ***@***.***> Cc: Jermiin, Lars ***@***.***>, Author ***@***.***> Subject: Re: [iqtree/iqtree2] Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA (Discussion #224) EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe. RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte. I agree here. The +H models are a special case that need to be (very) carefully examined whenever they are used. Ditto the tree search for every model. Given the information in the literature on the relatively minor differences that model selection makes to inference, I lean a long way in the other direction. We should test a small and sensible set of models, in a way that is quick and efficient. (While, of course, allowing users to do more computationally demanding things if they so choose, with options like -mtree). To put it another way, I'd need to see some overwhelming evidence that phylogenetic inference is dramatically improved by things like -mtree, on a huge range of typical empirical datasets, before I would be convinced that it should be a default. The kind of thing that would start to convince me is if a large fraction of datasets showed differences in highly supported nodes when using that option. — Reply to this email directly, view it on GitHub<#224 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXEHLUEW4BUPBFMYSPITZH6PA7AVCNFSM6AAAAABJADQ762VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQMBQHEZDK>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA #224

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Valid (tried-and-tested) combinations of substitution models and rate-heterogeneity across sites models for DNA #224

lsjermiin Jun 8, 2024

Replies: 6 comments · 1 reply

roblanf Jun 11, 2024 Maintainer

lsjermiin Jun 13, 2024 Author

bqminh Jun 14, 2024 Maintainer

roblanf Jun 18, 2024 Maintainer

lsjermiin Jun 28, 2024 Author

lsjermiin Jun 28, 2024 Author

lsjermiin
Jun 8, 2024

Replies: 6 comments 1 reply

roblanf
Jun 11, 2024
Maintainer

lsjermiin
Jun 13, 2024
Author

bqminh
Jun 14, 2024
Maintainer

roblanf Jun 18, 2024
Maintainer

lsjermiin
Jun 28, 2024
Author

lsjermiin
Jun 28, 2024
Author