Possible mislabelling of Q.bird substitution model #186

lsjermiin · 2024-05-02T09:03:49Z

lsjermiin
May 2, 2024

Hi,
I'm analysing an alignment of 488 nuclear-encoded amino acids of mammalian origin (48 sequences; 120,548 sites) and I am using ModelFinder with an extended set of substitution models. Surprisingly, Q.bird is favoured over Q.mammal, which seems rather odd, given that the data are of mammalian origin.

This leads to my question: are some of these substitution models mislabelled?

Regards,
Lars

Answered by roblanf

May 3, 2024

I double checked and @bqminh is right - no mislabelling. I'll post some code to do it because it can help with checking any model one might estimate.

To check, I randomly sampled 100 loci from the datasets we used to estimate the mammal and bird models and ran ModelFinder on those loci like this (the nexus file has the partitions and the sequence):

iqtree2 -s alignment.nex -S alignment.nex --prefix 100 -T 32 -m TESTONLY --subsample 100 --subsample-seed 1

Then I counted up the models best fit to each of the 100 loci like this:

grep '^ *[^ ]\+:' 100.best_scheme.nex | awk -F: '{print $1}' | awk '{print $NF}' | cut -d'+' -f1 | sort | uniq -c | sort -nr

The results. For the 100 bird loci:

…

View full answer

bqminh · 2024-05-02T13:18:38Z

bqminh
May 2, 2024
Maintainer

I'm sure it's not mislabeling. There's been some discussions about that in the google group. Q.bird and Q.mammal are actually quite similar (see PCA figure 3 of QMaker paper https://academic.oup.com/sysbio/article/70/5/1046/6146362). I'd guess that for your dataset, Q.bird happens to be better than Q.mammal (maybe doing so only slightly), but it might not be the case for other datasets.

Suggestion: because you seem to have a lot a data, I'd recommend to estimate a Q matrix for your own dataset, and use it to infer a tree. This is what we suggested in QMaker.

1 reply

roblanf May 3, 2024
Maintainer

I double checked and @bqminh is right - no mislabelling. I'll post some code to do it because it can help with checking any model one might estimate.

To check, I randomly sampled 100 loci from the datasets we used to estimate the mammal and bird models and ran ModelFinder on those loci like this (the nexus file has the partitions and the sequence):

iqtree2 -s alignment.nex -S alignment.nex --prefix 100 -T 32 -m TESTONLY --subsample 100 --subsample-seed 1

Then I counted up the models best fit to each of the 100 loci like this:

grep '^ *[^ ]\+:' 100.best_scheme.nex | awk -F: '{print $1}' | awk '{print $NF}' | cut -d'+' -f1 | sort | uniq -c | sort -nr

The results. For the 100 bird loci:

     57 Q.bird
     23 Q.mammal
     10 Q.plant
      4 JTTDCMut
      2 JTT
      2 Dayhoff
      1 WAG
      1 LG

And for the 100 mammal loci:

     56 Q.mammal
     23 Q.bird
     12 Q.plant
      7 JTT
      1 Q.insect
      1 Blosum62

So, the models are best fit to the right datasets. But as MInh points out the bird model is second best on the mammal loci (by count of # loci best fit), and vice versa.

Rob

Answer selected by roblanf

bqminh · 2024-05-02T13:23:41Z

bqminh
May 2, 2024
Maintainer

PS: Buy me a coffee if you think it's helpful :-) https://buymeacoffee.com/bqminh

0 replies

lsjermiin · 2024-05-02T19:55:37Z

lsjermiin
May 2, 2024
Author

Hi Minh, Ah, yes – the two models are placed quite close to one another in Fig 3 of that paper. The difference between Q.bird and Q.mammal is huge (2973.918 BIC), which is partly why I was perplexed. Your suggestion is sensible. I ran the analysis last year and got a vastly improved estimate. I also tried the NQ models for the same data, but IQ-TREE 2 aborted with an error. I uploaded the error to GitHub on 12 Oct 2023, but I don’t think anyone has found the bug yet. All the best, Lars From: Bui Quang Minh ***@***.***> Date: Thursday, 2 May 2024 at 14:19 To: iqtree/iqtree2 ***@***.***> Cc: Jermiin, Lars ***@***.***>, Author ***@***.***> Subject: Re: [iqtree/iqtree2] Possible mislabelling of Q.bird substitution model (Discussion #186) EXTERNAL EMAIL: This email originated outside the University of Galway. Do not open attachments or click on links unless you believe the content is safe. RÍOMHPHOST SEACHTRACH: Níor tháinig an ríomhphost seo ó Ollscoil na Gaillimhe. Ná hoscail ceangaltáin agus ná cliceáil ar naisc mura gcreideann tú go bhfuil an t-ábhar sábháilte. I'm sure it's not mislabeling. There's been some discussions about that in the google group. Q.bird and Q.mammal are actually quite similar (see PCA figure 3 of QMaker paper https://academic.oup.com/sysbio/article/70/5/1046/6146362). I'd guess that for your dataset, Q.bird happens to be better than Q.mammal (maybe doing so only slightly), but it might not be the case for other datasets. Suggestion: because you seem to have a lot a data, I'd recommend to estimate a Q matrix for your own dataset, and use it to infer a tree. This is what we suggested in QMaker. — Reply to this email directly, view it on GitHub<#186 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG6JXECW3A5N4ALDD5SMI6DZAI4MHAVCNFSM6AAAAABHDJ7LCWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TEOJVGYYTQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

bqminh May 3, 2024
Maintainer

Re the bug: did you post the bug report in github issues or google group? If google group, can you pls create a github issue and send the data file to @thomaskf ? (I didn't have time to check google group frequently, but this github discussion looks pretty neat). Note that we can answer questions pretty quick, but bugs take a lot more time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible mislabelling of Q.bird substitution model #186

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possible mislabelling of Q.bird substitution model #186

lsjermiin May 2, 2024

Replies: 3 comments · 2 replies

bqminh May 2, 2024 Maintainer

roblanf May 3, 2024 Maintainer

bqminh May 2, 2024 Maintainer

lsjermiin May 2, 2024 Author

bqminh May 3, 2024 Maintainer

lsjermiin
May 2, 2024

Replies: 3 comments 2 replies

bqminh
May 2, 2024
Maintainer

roblanf May 3, 2024
Maintainer

bqminh
May 2, 2024
Maintainer

lsjermiin
May 2, 2024
Author

bqminh May 3, 2024
Maintainer