-
Notifications
You must be signed in to change notification settings - Fork 194
Transcriptomic analyses
This wiki describes how to use vg rna
and related tools for transcriptomic analyses.
The full-name options in vg rna
(e.g. --transcripts
) do not work in vg v1.49.0 and older. For these versions please use instead the corresponding single letter options in vg rna
(e.g. -n
).
The splicing structure of a gene can be represented as a graph, similar to how genomic variant information can be represented using a pangenome graph. Here nodes and edges correspond to exons and splice-junctions, respectively. With transcripts represented as paths through the graph. Without the introns and intergenic regions these are also known as splicing graphs.
A spliced reference graph can be combined with a pangenome graph to produce a spliced pangenome graph containing both transcriptomic splicing information and genomic variant information.
Paths through this graph can still represent haplotypes and transcripts, but also now haplotype-specific transcripts.
vg rna
can be used to create a spliced pangenome graph by adding splice-junctions to an existing pangenome graph:
vg rna -p --threads <threads> --transcripts annotation.[gtf|gff3] graph.pg > spliced_graph.pg
Another approach to create a spliced pangenome graph is to use vg autoindex
. vg autoindex
will create a spliced pangenome graph and indexes needed for mapping using a best-practice pipeline:
vg autoindex --threads <threads> --workflow mpmap --prefix <output-prefix> --ref-fasta reference.fasta --vcf variants.vcf.gz --tx-gff annotation.[gtf|gff3]
See the Automatic indexing for read mapping and downstream inference wiki for more information on vg autoindex
.
By default, all reference contigs in the annotation (column 1) must be part of the graph as embedded reference paths. It is also possible to use haplotypes in a Graph Burrows-Wheeler Transform (GBWT) index as reference paths:
vg rna -p --threads <threads> --transcripts annotation.[gtf|gff3] --use-hap-ref --haplotypes haplotypes.gbwt graph.pg > spliced_graph.pg
For more information on how to construct a GBWT index of haplotypes see the VG GBWT Subcommand wiki describing vg gbwt
.
A pangenome graph in GFA format constructed using for example the Minigraph-Cactus pipeline version < v.2.3.0 should first be converted to GBZ format before running vg rna
:
vg gbwt -p --gbz-format --graph-name graph.gbz --gfa-input graph.gfa
The above step is not necessary with Minigraph-Cactus versions >= v2.3.0 as they can produce GBZ output directly with the --gbz
option.
In both cases, the GBZ graph can then be annotated with splicing information as follows:
vg rna -p --threads <threads> --transcripts annotation.[gtf|gff3] --use-hap-ref --gbz-format graph.gbz > spliced_graph.pg
Graph format: vg rna
supports any of the handle graph implementations and will use the same format for the graph output as the input (except for GBZ). It is, however, recommended that the PackedGraph format is used as it strikes a good balance between memory usage and graph edit speed. A graph can be converted to the PackedGraph format using vg convert -p
. If the graph and haplotype input is in GBZ format use --gbz-format
. When using GBZ the output graph will be in PackedGraph format.
Transcript annotation: vg rna
supports both the GTF and GFF3 transcript annotation format. By default only lines with the exon feature (column 4) will be parsed. This can be changed using --feature-type
. The attribute tag (column 9) that will be used as the transcript id/name can be changed using --transcript-tag
(default: transcript_id).
Intron database: Besides transcripts, a database of introns can also be added as splice-junctions to a graph. This can be done using --introns
. The input format is BED with the start and end being the intron boundaries. If no strand (column 6) is given it will default to +.
Transcript paths: Reference transcript paths can be added as embedded paths to the graph using --add-ref-paths
. Reference transcript paths are transcripts that follow the reference paths defined in the annotation (column 1).
Splicing graph: By default vg rna
will construct a spliced pangenome graph that includes the intergenic and intronic regions. If only the exonic regions (splicing graph) are of interest use --remove-non-gene
. Note that all existing embedded paths will be deleted (including the reference). It is therefore recommended that reference transcript paths are added to the graph using --add-ref-paths
(see above).
In a similar manner to the spliced pangenome graph we can also combine transcripts and haplotypes to construct a pantranscriptome. A pantranscriptome is a population-level transcriptomic reference consisting of a collection of haplotype-specific transcripts, which can be represented as paths in a spliced pangenome graph.
In addition to constructing spliced pangenome graphs, vg rna
can also be used to create pantranscriptomes. One way to create a pantranscriptome is to project (lift-over) reference transcript paths onto haplotypes in a GBWT index. This can be done using the following command:
vg rna -p --threads <threads> --transcripts annotation.[gtf|gff3] --haplotypes haplotypes.gbwt --write-gbwt pantranscriptome.gbwt --write-info pantranscriptome.txt graph.pg > spliced_graph.pg
The resulting pantranscriptome which consists of projected haplotype-specific transcript paths is written to a GBWT index using --write-gbwt
. Information on the transcript and haplotype origins of each transcript path in the pantranscriptome is written to a TSV file using --write-info
. Both files are needed for downstream analyses using rpvg (see Downstream analyses section).
It is often more efficient to run vg rna
twice if a GBWT index with many haplotypes is used as input for pantranscriptome construction using projection. In this three step pipeline a spliced pangenome graph is first constructed. Next, a haplotype GBWT index is created against this spliced pangenome graph using vg gbwt
. Finally, a pantranscriptome is constructed using the spliced pangenome graph and haplotype GBWT index. The reason for using this approach is that the GBWT index needs to be updated when splice-junctions are added to the graph, which can be really slow.
This three step process is automated by the vg autoindex
command (use rpvg
workflow to create a pantranscriptome):
vg autoindex --threads <threads> --workflow mpmap --workflow rpvg --prefix <output-prefix> --ref-fasta reference.fasta --vcf variants.vcf.gz --tx-gff annotation.[gtf|gff3]
It is also possible to create a pantranscriptome containing only reference haplotype-specific transcript paths:
vg rna -p --threads <threads> --transcripts annotation.[gtf|gff3] --use-hap-ref --haplotypes haplotypes.gbwt --write-gbwt pantranscriptome.gbwt --write-info pantranscriptome.txt graph.pg > spliced_graph.pg
Example using a graph and haplotypes in GBZ format as input:
vg rna -p --threads <threads> --transcripts annotation.[gtf|gff3] --use-hap-ref --gbz-format --write-gbwt pantranscriptome.gbwt --write-info pantranscriptome.txt graph.gbz > spliced_graph.pg
Projection: vg rna
supports both haplotypes that are stored in a GBWT index and haplotypes that are embedded as paths in the graph. To project (lift-over) transcripts onto all paths embedded in the graph use --proj-embed-paths
.
Collapsing: By default vg rna
will collapse identical haplotype-specific versions of the same transcript into a single transcript path. This can be changed using --path-collapse
, which can take three different values: "no" disables collapsing (keep all versions), "haplotype" is the default and "all" enables collapsing across both different transcripts and haplotypes (compares all-against-all).
Transcript paths: Projected haplotype-specific transcript paths can be added as embedded paths to the spliced pangenome graph using --add-hap-paths
. This is generally not recommended for really large path sets.
Pantranscriptome: A bidirectional pantranscriptome GBWT index, storing each transcript path in both directions, can be created using --gbwt-bidirectional
. This can make downstream analyses using rpvg (see Downstream analyses section) more efficient if the reads are not strand-specific. By default vg rna
will output both reference and projected transcript paths as part of the pantranscriptome. Reference transcript paths can be excluded using --out-exclude-ref
.
All of the standard tools in the vg toolkit also work on spliced pangenome graphs. However, some tools have been optimized or designed specifically for transcriptomic analyses.
To map RNA-seq reads to a spliced pangenome graph we recommend using vg mpmap
as it has been specifically optimized for RNA-seq data. More information on how to run vg mpmap
with RNA-seq data can be found in the Multipath alignments and vg mpmap wiki.
rpvg can be used to infer the expression of the transcripts in a pantranscriptome. While not specifically part of the vg toolkit it is made by the same developers and uses the output from vg rna
and vg mpmap
. rpvg takes as input a spliced pangenome graph, read alignments in either gam or gamp format and a pantranscriptome GBWT index with transcript paths.