CAMI2 human microbiome benchmarking #147

u-xixi · 2023-03-23T21:54:39Z

Hi vamb devs,

I have some small questions regarding the benchmarking in the vamb publication. I know this work was probably some years back and details blur. Hopefully I'm not bothering you too much. I am doing some benchmarking for a few binners and I used the CAMI2 human microbiome. I used the pooled gold standard assembly (gsa_pooled). The total number of sequences in gsa_pooled (~210k) are less than half of the sum of the separate gsa (~520k).

Using AMBER to evaluate, I recovered way less genomes. For instance, AMBER says there are 29 near-complete (NC) genomes for the GI samples in my vamb binning, but you got near 100 strains in the paper.

This was my command: vamb --outdir $out_dir --fasta $gsa_pooled -m 2000 --bamfiles $(ls ${in_dir}*.bam) -o C --minfasta 30000, following your recommendation of filtering out the contigs under 2000, and I set the --minfasta because vamb recovers small bins too, which is nice for users but maybe less advantageous for some evaluation metrics.

I know that you did separate assembly for each sample in that paper and this pooled assembly is not really the most recommended way to use vamb, but I still want to fully understand the results. May I ask (1) how did you count the NC strains in the paper? Amber certainly measures this differently from your method. (2) Could it be some other factors, such as unrealistic contig length distribution in the pooled gsa? (3) And how did you ran the other tools like metabat, to compare with vamb? They are not typically conditioned to separate assemblies as I understand.

thanks in advance,
Xixi

The text was updated successfully, but these errors were encountered:

jakobnissen · 2023-03-24T10:50:12Z

Dear Xixi

You can reproduce our findings most easily with the CodeOcean capsule linked in the original paper. This capsule used Vamb 2.0.1. The technique for benchmarking and counting NC genomes is described in the paper (method section, "Benchmarking"), and can also be found in the code at vamb/benchmark.py for version 2.0.1.

I haven't run AMBER myself on the dataset, nor have much experience with it. From reading the paper, there are a few differences from AMBER to our method:

AMBER assigns each bin to exactly one genome based on the number of true positives, then computes recall and precision based on that genome/bin pair. Vamb computes all genome/bin pairs to check if a genome has been reconstructed in any bin.
AMBER takes as false negatives all of the genome positions not covered by a contig in the given bin. Vamb takes as false negatives all genome positions not covered by a contig in the bin AND which is covered from a contig of another bin
From the looks of it, AMBER appears to rely on mapping, whereas Vamb simply used the source positions of the contigs from the simulated dataset.

It's probably not unrealistic contig length. We benchmarked on the sample-wise gold standard assembly, which also have unrealistic contig length distribution and get excellent results.

We ran MetaBAT2 with the same approach as Vamb - on single-sample assemblies. In Supplementary Figure 18, we show the result of running Vamb with the normal MetaBAT2 workflow. We also ran MetaBAT2 and then used Vamb's samplewise bin-splitting approach, but these results were not in the paper. The general picture is that Vamb is slightly better than MetaBAT2, and binsplitting makes it significantly better.

I'm not sure why you get so relatively poor results for Vamb in your run. How do these compare to MetaBAT2? It's posible this is due to the many closely related genomes that are conflated when not using binsplitting - a major factor for the CAMI2 datasets.

u-xixi · 2023-03-24T16:19:55Z

Hey Jakob,

thanks for the reply.

My overall impression was MetaBAT2 output was not as surprising. The number of recovered genomes are a bit less, but not contrasting your results that much. I could guess it has something to do with the pooled contigs. MetaBAT2 also seems to have quite consistent performance for all the datasets I used, multi-samples or single sample. Low contamination, and decent number of recovered genomes.

Vamb was comparable to metabat in airway, skin and oral, but the results are poorer in gi and urog. The latter two are significantly smaller, as you probably also noticed.

I second your guess on the closely related genomes. The gsa_pooled have the contigs conflated and it affects the performance of tools. I'll just run them the binsplitting way to see the difference. And thanks for pointing out the codeocean capsule, but i think you attached the link of another project. This was what you put in the paper.

Another small finding is that there is little difference between the no. of genomes with <5% and <%10 contamination for vamb and metabat, but not Maxbin. However, with the way AMBER computes the overall purity, Maxbin could score even better. But I tested with some other dataset that highlights the problem of closely related genomes. Maxbin always have more contamination and could not outperform metabat. I haven't test vamb on them, because they are single samples that would not have a chance to use the binsplitting feature.

I guess in the end I can only say it depends on what the user pursues in the result. Purity and more high-quality bins are good, but may not always be well appreciated by the benchmarking metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAMI2 human microbiome benchmarking #147

CAMI2 human microbiome benchmarking #147

u-xixi commented Mar 23, 2023

jakobnissen commented Mar 24, 2023

u-xixi commented Mar 24, 2023

CAMI2 human microbiome benchmarking #147

CAMI2 human microbiome benchmarking #147

Comments

u-xixi commented Mar 23, 2023

jakobnissen commented Mar 24, 2023

u-xixi commented Mar 24, 2023