-
Notifications
You must be signed in to change notification settings - Fork 194
File Types
Adam Novak edited this page Dec 9, 2022
·
15 revisions
The vg ecosystem uses a lot of file formats. Some are new and not consistently used yet, and some are old and still required for some less-popular operations.
Some of these are described in more detail at File Formats and Index Types.
These formats store genome references that define spaces in which genomics can be done.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
VG Protobuf | The original vg graph format | .vg |
Conditionally useful | Can also be used to store paths without the nodes and edges they belong to. Default output format of vg construct in vg v1.40.0, since it can be generated incrementally. Can be concatenated with cat . Usually block-GZIP compressed, but some old files aren't. Consists of count-prefixed groups of length-prefixed Protobuf messages, where a string type tag takes the place of the first message in each group. |
GBZ | "GBZ" graph, a compressed format storing a graph as traversed by sample haplotypes | .gbz |
Recommended | Stores not only the graph but also large numbers of haplotypes, so you don't need an additional GBWT file. Internally, stores a GBWT and a GBWTGraph. Can't store edges that are not followed. |
GFA | Graphical Fragment Assembly: a text-based format for storing graphs and their embedded paths. | .gfa |
Recommended for interchange | vg uses GFA 1.x and doesn't really support GFA 2. |
HashGraph | Graph format based on a hashtable, from libbdsg. |
.hg , .vg
|
Recommended | Default output format of many vg subcommands as of v1.40.0. |
PackedGraph | Graph format based on succinct data structures, from libbdsg. |
.pg , .vg
|
Recommended for large graphs | This format can store graphs in less space than HashGraph, but is also slower and more complicated. |
Memory-Mapped PackedGraph | A version of PackedGraph that can be incrementally read from disk |
.mpg ? |
Experimental | Might not actually be adopted; GBZ solves a different but often more important problem |
ODGI (vg flavor) | "Optimized Dynamic Genome/Graph Implementation" format. | .odgi |
Removed | vg used to support a version implemented in libbdsg, which was NOT wire-compatible with the version implemented in the odgi project, and so was removed. |
XG | Compressed, immutable graph format. Doesn't really stand for anything. | .xg |
Conditionally useful | PackedGraph may be better, but many tools reference "xg files" as historically this was the only practical format for whole-genome graphs. |
GBWTGraph | Supplemental information to turn a GBWT into a graph. | .gg |
Conditionally useful | Stores node sequences only; defines a graph when used together with a GBWT file of haplotypes. |
VG JSON | This is the VG Protobuf format, with the Protobuf Graph objects represented as JSON. |
.json |
Conditionally useful | Useful for exporting small graphs for analysis with jq , or importing graphs from tools that can't use libbdsg or libvgio . Generally GFA should be used instead. |
Indexed VG Protobuf | This is the VG Protobuf format, stored in a sorted order with an auxilliary index file for random access. | .sorted.vg |
Deprecated | Was never very popular, and Memory-Mapped PackedGraph is intended as a replacement. |
FASTA | "FASTA" format for storing DNA sequences |
.fa , .fasta , .fna
|
Recommended for linear references | This is a linear genome reference format that vg construct can consume. |
These formats store short or long reads from DNA sequencing machines, and can describe how they fit into references.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
GAM Protobuf | Graph Alignment/Map, vg's main format for aligned reads | .gam |
Recommended | |
GAF | Graph Alignment Format, a text-based format for aligned reads | .gfa |
Recommended for interchange | |
Sorted GAM | GAM file with reads sorted by graph node ID. Useful for random access with an index. | .sorted.gam |
Recommended | |
GAM JSON | JSON version of the Protobuf GAM format. | .json |
Conditionally useful | Used for analyzing reads with jq . |
GAMP Protobuf | Multi-path alignment version of GAM | .gamp |
Recommended | |
GAMP JSON | JSON version of GAMP format | .json |
Conditionally useful | |
BAM | Binary Alignment/Map format for alignments against a linear reference | .bam |
Recommended | |
SAM | Sequence/Alignment Map format, a text-based version of BAM | .sam |
Recommended | |
FASTQ | Version of FASTA with per-base quality scores. Used for unaligned reads. |
.fq , .fastq
|
Recommended |
These formats can describe individual people or other organisms and how their genomes fit into or differ from references.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
GBWT | Graph Burrows-Wheeler Transform file, storing haplotypes for samples | .gbwt |
Conditionally useful | It sometimes makes more sense to use a GBZ. |
GBZ | See above under Reference Formats | |||
VCF | Variant Call Format file, storing sample genotypes and haplotypes against a linear reference |
.vcf , .vcf.gz
|
Recommended | Not all VCF 4.3 features are supported by vg |
Pack File | Stores read information as counts of visited graph elements | .cx |
Recommended | |
Pileup Protobuf | Stores read information as counts of visited graph elements |
.pileup ? |
Deprecated | |
Pileup JSON | JSON version of the Pileup Protobuf format | .json |
Deprecated | |
Locus Protobuf | Stores genotypes against a graph reference | .loci |
Experimental | |
Locus JSON | JSON version of the Locus Protobuf format | .json |
Deprecated |
These formats store other kinds of information, or are precomputed indexes to speed up operations on other data.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
Distance Index (v1) | Index for computing distances between points in a graph | .dist |
Recommended | Used in vg giraffe
|
Distance Index (v2) | Index for computing distances between points in a graph | .dist |
Experimental | |
GCSA | Generalized Compressed Suffix Array, version 2, for finding substrings in a graph | .gcsa |
Recommended | Used in vg map and vg mpmap
|
Minimizer Index | Used to find "minimizer" substrings in a graph | .min |
Recommended | Used in vg giraffe
|
BED | Browser Extensible Data format, used for defining regions | .bed |
Recommended | |
Dot | GraphViz input format | .dot |
Conditionally useful |
vg view -d can export graphs in Dot format for visualization with GraphViz's dot tool. |
Snarl Protobuf | Hierarchical decomposition of a graph into variable sites, called "snarls" | .snarls |
Recommended | |
Snarl JSON | JSON representation of Protobuf snarl data | .json |
Conditionally useful | |
SnarlTraversal Protobuf | Binary representation of possible paths through snarls |
.trav ? |
Conditionally useful | |
SnarlTraversal JSON | Text representation of possible paths through snarls | .json |
Conditionally useful | |
Node ID Translation | Recorded information about changes made to nodes while modifying a graph | .trans |
Conditionally useful | |
VG Protobuf Index | Index over a sorted VG Protobuf file | .vgi |
Experimental | |
GAM Index | Index over a sorted GAM Protobuf file |
.gai , .gam.index
|
Recommended | Useful for vg chunk to fetch out reads for a particular region |
FASTA Index | Index over a FASTA file for random access | .fai |
Recommended | |
BAM Index | Index over a sorted BAM file for random access | .bai |
Recommended | |
Tabix VCF Index | Index over a sorted, compressed VCF file for random access | .tbi |
Recommended |