Skip to content

Key Limitations

Gavin Douglas edited this page Jun 17, 2019 · 8 revisions

There are several limitations to keep in mind when analyzing PICRUSt2 output, which are mainly related to predictions being limited to the gene contents of existing reference genomes.

  • The accuracy on any given sample type will depend heavily on the availability of appropriate reference genomes. You can partially assess this problem by computing the per-ASV and sample-weighted nearest-sequenced taxon index (NSTI) values, which will give you a rough idea of how well-represented your ASVs are by the reference database (see tutorial). However, 16S rRNA gene sequences do not typically enable resolution of strain variation within a species. Strains of prokaryotic species can vary in gene content to remarkable degrees so the predictions should always be taken with a grain of salt.

  • A related issue is that the certain environments are better represented by reference genomes than others. For instance, PICRUSt2 is expected to perform better on 16S sequences from the human gut compared to the cow rumen, even if the actual 16S sequences themselves are very similar. The reason for this is that many important rumen-specific enzymes will be missing in the default reference genomes. One potential solution to this problem is to create a custom reference database of genomes specific to your environment of interest for making predictions. It is worth noting that our validations on non-human associated environment indicate that the overall predictions perform better than random, but nonetheless we expect that many niche-specific functions will be poorly represented.

  • By default input sequences with NSTI values above 2 will be excluded from the analysis. This could potentially affect some samples much more than others, which should be evaluated (i.e. you can determine what proportion of the community relative abundance was excluded per sample, which is typically extremely little).

  • PICRUSt can only predict genes that are in the input function tables (which correspond to KEGG orthologs and Enzyme Classification numbers by default). Although these gene families are useful, they typically represent a small proportion of metagenome genetic variation.

PICRUSt output that maps gene families to putative functions or pathways is purely based on the particular input reference used. Therefore, any gaps or inaccuracies in pathway annotation or assignments of gene function will still be present. As an example, many KEGG Orthology groups are listed as participating in pathways not found in bacteria or otherwise not reflective of true function. In many cases this is simply due to bacteria containing (distant) homologs of enzymes with important roles in, for example, mammalian pathways. Therefore, it is worth carefully checking KEGG pathway annotations to ensure that they are reasonable for your system.

  • Since the input data for the standard PICRUSt workflow is 16S rRNA, any eukaryotic or viral contributions to the metagenome will not be predicted. Therefore it is best to think of PICRUSt as predicting the portion of the full metagenome contributed by the organisms targeted by your primers.

Biased primers may result in inaccurate predictions. Only genes from organisms amplified by your primer will be included. The main use for PICRUSt is in taking 16S rRNA data, and predicting a metagenome from that data using evolutionary modelling of how gene content has changed relative to sequenced genomes. Therefore PICRUSt can only predict the portion of the metagenome that is contributed by the set of organisms picked up by your primers. The PICRUSt validation datasets used universal 515f/806r V4 16S rRNA primers designed to minimize (though not eliminate) bias across bacterial/archaeal taxonomy [link Caporaso et al 2011, Walters et al 2011]. If the primers used don’t amplify an organism, then of course that organisms’ contribution to the metagenome is not predicted. As an example, many popular 16S rRNA primers including 27F/338R did not work well for amplifying Verrucomicrobia (see Bergmann et al 2011). If your sample had a large proporition of Verrucomicrobia in it, and you used such a primer set, then of course the metagenome predicted by PICRUSt would also underestimate genes contributed by Verrucomicrobia.