-
Notifications
You must be signed in to change notification settings - Fork 194
File Types
Adam Novak edited this page May 27, 2022
·
15 revisions
The vg ecosystem uses a lot of file formats. Some are new and not consistently used yet, and some are old and still required for some less-popular operations.
Some of these are described in more detail at Index Types.
These formats store genome references that define spaces in which genomics can be done.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
VG Protobuf | The original vg graph format | .vg |
Conditionally useful | Can also be used to store paths without the nodes and edges they belong to. Default output format of vg construct in vg v1.40.0, since it can be generated incrementally. Can be concatenated with cat . Usually block-GZIP compressed, but some old files aren't. |
GBZ | "GBZ" graph, a compressed format storing a graph as traversed by sample haplotypes | .gbz |
Recommended | Stores not only the graph but also large numbers of haplotypes, so you don't need an additional GBWT file. Internally, stores a GBWT and a GBWTGraph. Can't store edges that are not followed. |
GFA | Graphical Fragment Assembly: a text-based format for storing graphs and their embedded paths. | .gfa |
Recommended for interchange | vg uses GFA 1.x and doesn't really support GFA 2. |
HashGraph | Graph format based on a hashtable, from libbdsg. |
.hg , .vg
|
Recommended | Default output format of many vg subcommands as of v1.40.0. |
PackedGraph | Graph format based on succinct data structures, from libbdsg. |
.pg , .vg
|
Recommended for large graphs | This format can store graphs in less space than HashGraph, but is also slower and more complicated. |
Memory-Mapped PackedGraph | A version of PackedGraph that can be incrementally read from disk |
.mpg ? |
Experimental | Might not actually be adopted; GBZ solves a different but often more important problem |
ODGI (vg flavor) | "Optimized Dynamic Genome/Graph Implementation" format. | .odgi |
Not recommended | vg uses the version implemented in libbdsg, which is NOT wire-compatible with the version implemented in the odgi project. |
GBWTGraph | Supplemental information to turn a GBWT into a graph. | .gg |
Conditionally useful | Stores node sequences only; defines a graph when used together with a GBWT file of haplotypes. |
VG JSON | This is the VG Protobuf format, with the Protobuf Graph objects represented as JSON. |
.json |
Conditionally useful | Useful for exporting small graphs for analysis with jq , or importing graphs from tools that can't use libbdsg or libvgio . Generally GFA should be used instead. |
Indexed VG Protobuf | This is the VG Protobuf format, stored in a sorted order with an auxilliary index file for random access. | .sorted.vg |
Deprecated | Was never very popular, and Memory-Mapped PackedGraph is intended as a replacement. |
FASTA | "FASTA" format for storing DNA sequences |
.fa , .fasta , .fna
|
Recommended for linear references | This is a linear genome reference format that vg construct can consume. |
These formats store short or long reads from DNA sequencing machines, and can describe how they fit into references.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
GAM Protobuf | Graph Alignment/Map, vg's main format for aligned reads | .gam |
Recommended | |
GAF | Graph Alignment Format, a text-based format for aligned reads | .gfa |
Recommended for interchange | |
Sorted GAM | GAM file with reads sorted by graph node ID. Useful for random access with an index. | .sorted.gam |
Recommended | |
GAM JSON | JSON version of the Protobuf GAM format. | .json |
Conditionally useful | Used for analyzing reads with jq . |
GAMP Protobuf | Multi-path alignment version of GAM | .gamp |
Recommended | |
GAMP JSON | JSON version of GAMP format | .json |
Conditionally useful | |
BAM | Binary Alignment/Map format for alignments against a linear reference | .bam |
Recommended | |
SAM | Sequence/Alignment Map format, a text-based version of BAM | .sam |
Recommended | |
FASTQ | Version of FASTA with per-base quality scores. Used for unaligned reads. |
.fq , .fastq
|
Recommended |
These formats can describe individual people or other organisms and how their genomes fit into or differ from references.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
GBWT | ||||
GBZ | ||||
VCF | ||||
Pack File | ||||
Pileup Protobuf | ||||
Pileup JSON | ||||
Locus Protobuf | ||||
Locus JSON |
These formats store other kinds of information, or are precomputed indexes to speed up operations on other data.
Name | Description | Extension | Status | Notes |
---|---|---|---|---|
Distance Index (v1) | ||||
Distance Index (v2) | ||||
GCSA | ||||
Minimizer Index | ||||
BED | ||||
Snarl Protobuf | ||||
Snarl JSON | ||||
SnarlTraversal Protobuf | ||||
SnarlTraversal JSON | ||||
Node ID Translation | ||||
VG Protobuf Index | ||||
GAM Index | ||||
FASTA Index | ||||
BAM Index | ||||
Tabix VCF Index |