Skip to content

CZID mNGS Pipeline Stage #3: Reporting and Visualization

Ryan Lim edited this page Jan 19, 2022 · 1 revision

Reports

The report for each sample returns a number of metrics for each taxonomic ID. The following table defines the metrics presented in the Sample Report.

Report Metric Definition
Score Experimental ranking score for prioritizing microbes based on abundance within the sample (rPM) as well as compared to control samples (z-score); computed as ((abs(genus NT Z) * species NT Z * species NT rPM) + (abs(genus NR Z) * species NR Z * species NR rPM))
Z Z-score statistic, used for evaluating prevalence of microbes in your sample as compared to background contaminants. The Z-score computed based on the specified background model.
rPM Number of reads aligning to the taxon in the NCBI NT/NR database, per million reads sequenced
r Number of reads aligning to the taxon in the NCBI NT/NR database
contig Number of assembled contigs aligning to the taxon in the NCBI NT/NR database
contig r Total number of reads aligning to all assembled contigs for this taxon
%id Average percent-identity of alignments to NCBI NT/NR
L Average length of the local alignment for all contigs and reads assigned to this taxon
log(1/E) Average log10 transformed expect value (e-value) of alignments to NCBI NT/NR

Background Models

The z-score provides an estimate of the relative abundance of a taxon in a sample, given the prevalence of those samples in the background. The background model can be specified at the top of the report viewer.

NT v. NR Counts

The NCBI NT database contains a collection of nucleotide sequences from several sources, including GenBank, RefSeq, TPA and PDB. Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery. The NCBI NR database is the non-redundant protein database. For interpreting results, NT generally provides higher sensitivity. However, for viruses which relatively high mutation rates, it is possible that reads will not map to NT but instead to NR.


Samples Table

The samples table contains a series of metrics to summarize the quality of individual samples. These include relevant metadata fields as well as QC metrics obtained during the Host Filtration and QC steps of the pipeline. The following table defines each of the values in the Samples Table. Default columns are listed first, but the user may choose to view additional columns, marked with an asterisk*.

Samples Table Metric Definition
Sample The user-defined sample name.
Uploaded On The date on which the sample was initially uploaded to IDseq.
Host User-selected organism from which this sample was collected; this value is selected by the user at sample upload and dictates which genomes are used for inital host subtraction pipeline steps.
Location User-defined location from which the sample was collected.
Passed Filters The percentage of reads that came out of step (8) of the host filtration and QC steps as compared to what went in at step (1).
Passed QC The percentage of reads that came out of PriceSeq, step (3) of the host filtration and QC steps, compared to what went in to Trimmomatic, step (2).
Total Reads* The total number of reads uploaded.
DCR* Duplicate Compression Ratio; The ratio of sequences present prior to running cd-hit-dup (duplicate removal) vs after duplicate removal. High values indicate the presence of more duplicate reads, indicating lower library complexity.
ERCC reads* The total number of reads aligning to ERCC (External RNA Controls Consortium) sequences.
Notes* User-supplied notes.
Nucleotide Type* User-selected metadata field indicating the nucleotide type (RNA, DNA).
Sample Type* User-supplied metadata field indicating the sample type.
SubSampled Fraction* After host filtration and QC, the remaining reads are subsampled to 1 million fragments (2 million paired reads). This field indicates the ratio of subsampled reads to total reads passing host filtration and QC steps.
Total Runtime* The total time required by the IDseq pipeline to process .fastq files into IDseq reports.

ERCC quality interpretation

The ERCC reads and total reads are all displayed in "Details" tab in the browser. Total reads can also be obtained from the project sample_table.csv. If you need to programatically obtain the ERCC reads, they can be gotten from the reads_per_gene.star.tab results file as follows:

grep "^ERCC" reads_per_gene.star.tab | awk '{sum += $2} END {print 2*sum}'

That is, filter to all rows starting with "ERCC", sum the second column, then multiply by 2.

The total input RNA (in picograms) for a sample can be estimated using ERCC spike-ins by:

input_pg = ercc_pg / ercc_reads * (total_reads - total_ercc_reads)

How does a user view the raw data about the ERCC results? Please ask IDseq support for the gene counts for the project. This includes the ERCC information.

What is the best way to get a table of ERCC counts for all samples in a project? ERCC counts are in the project sample table, column total_ercc_reads.


Phylogenetic Tree Visualization

How is the phylogenetic tree built? The tree topology is estimated in a reference-free manner using the kSNP3 package. The kSNP method is explained in more detail in this manuscript. It identifies k-mers that are shared between samples, except for the central character in the k-mer. Differences in the central characters of k-mers are counted as SNPs separating the samples. We use k=13 for viruses and k=19 for bacteria. Once a distance matrix has been estimated in this way, the phylogenetic tree is built using the principle of maximum parsimony.

How should branch lengths be interpreted? It is a peculiarity of the kSNP3 method that branch lengths only have relative meanings, not absolute ones. Branch lengths should not be used for estimation of divergence times, only for estimation of tree topology.

How are NCBI references chosen? For each sample, we recover the NCBI accession that had the most reads mapped to it during GSNAP alignment (within the chosen species/genus). We include that NCBI accession on the tree for context. In principle, one NCBI reference will be added per sample, but it is possible that multiple samples share the same top NCBI match.


FAQ

Q: Why might I be seeing many reads to NR but few reads to NT?

A: Especially for viral mutations, you may see many NR matches (still generating matched proteins) but no NT matches.

Q: Why might I be seeing many reads to NT but few reads to NR?

A: You may notice the unusual profile of, say, 100K+ reads aligning to a virus by NT and nothing by NR. This can be explained by all these reads mapping to a non-coding region of a virus (which is a little weird in the context of viruses, but a real result. Meanwhile, this is a more common case for bacteria and eukaryotes). In this example, you may see a separate set of hits with opposite pattern (0-very low NT aligned reads & crazy high # reads aligned to NR) but this is only seen on IDseq report page (for both reads and contig results). The contig summary file indicates large contigs with high % sequence identity at nt level and high % contig covered. In these cases you may want to focus on the one with a significantly lower e-value (indicating higher significance). [Thanks AK]

Q: Why don't my reads NT_r sum up to the total expected after host filtering?

A: There are three reasons why this may occur:

  1. The reads didn't align at all and therefore are in the unaligned_fasta.
  2. The reads align to a blacklist accession/taxid (cloning vectors, etc.) and therefore are given -1 in the hitsummary2 and won't be counted towards total.
  3. The reads align to deuterosomes and therefore are filtered out prior to displaying the report.