Skip to content

File Types

Adam Novak edited this page May 27, 2022 · 15 revisions

Glossary of vg-related File Types

The vg ecosystem uses a lot of file formats. Some are new and not consistently used yet, and some are old and still required for some less-popular operations.

Some of these are described in more detail at Index Types.

Reference Formats

These formats store genome references that define spaces in which genomics can be done.

Name Description Extension Status Notes
VG Protobuf The original vg graph format .vg Conditionally useful Can also be used to store paths without the nodes and edges they belong to. Default output format of vg construct in vg v1.40.0, since it can be generated incrementally. Can be concatenated with cat. Usually block-GZIP compressed, but some old files aren't.
GBZ "GBZ" graph, a compressed format storing a graph as traversed by sample haplotypes .gbz Recommended Stores not only the graph but also large numbers of haplotypes, so you don't need an additional GBWT file. Internally, stores a GBWT and a GBWTGraph. Can't store edges that are not followed.
GFA Graphical Fragment Assembly: a text-based format for storing graphs and their embedded paths. .gfa Recommended for interchange vg uses GFA 1.x and doesn't really support GFA 2.
HashGraph Graph format based on a hashtable, from libbdsg. .hg, .vg Recommended Default output format of many vg subcommands as of v1.40.0.
PackedGraph Graph format based on succinct data structures, from libbdsg. .pg, .vg Recommended for large graphs This format can store graphs in less space than HashGraph, but is also slower and more complicated.
Memory-Mapped PackedGraph A version of PackedGraph that can be incrementally read from disk .mpg? Experimental Might not actually be adopted; GBZ solves a different but often more important problem
ODGI (vg flavor) "Optimized Dynamic Genome/Graph Implementation" format. .odgi Not recommended vg uses the version implemented in libbdsg, which is NOT wire-compatible with the version implemented in the odgi project.
GBWTGraph Supplemental information to turn a GBWT into a graph. .gg Conditionally useful Stores node sequences only; defines a graph when used together with a GBWT file of haplotypes.
VG JSON This is the VG Protobuf format, with the Protobuf Graph objects represented as JSON. .json Conditionally useful Useful for exporting small graphs for analysis with jq, or importing graphs from tools that can't use libbdsg or libvgio. Generally GFA should be used instead.
Indexed VG Protobuf This is the VG Protobuf format, stored in a sorted order with an auxilliary index file for random access. .sorted.vg Deprecated Was never very popular, and Memory-Mapped PackedGraph is intended as a replacement.
FASTA "FASTA" format for storing DNA sequences .fa, .fasta, .fna Recommended for linear references This is a linear genome reference format that vg construct can consume.

Read and Alignment Formats

These formats store short or long reads from DNA sequencing machines, and can describe how they fit into references.

Name Description Extension Status Notes
GAM Protobuf Graph Alignment/Map, vg's main format for aligned reads .gam Recommended
GAF Graph Alignment Format, a text-based format for aligned reads .gfa Recommended for interchange
Sorted GAM GAM file with reads sorted by graph node ID. Useful for random access with an index. .sorted.gam Recommended
GAM JSON JSON version of the Protobuf GAM format. .json Conditionally useful Used for analyzing reads with jq.
GAMP Protobuf Multi-path alignment version of GAM .gamp Recommended
GAMP JSON JSON version of GAMP format .json Conditionally useful
BAM Binary Alignment/Map format for alignments against a linear reference .bam Recommended
SAM Sequence/Alignment Map format, a text-based version of BAM .sam Recommended
FASTQ Version of FASTA with per-base quality scores. Used for unaligned reads. .fq, .fastq Recommended

Sample Information Formats

These formats can describe individual people or other organisms and how their genomes fit into or differ from references.

Name Description Extension Status Notes
GBWT
GBZ
VCF
Pack File
Pileup Protobuf
Pileup JSON
Locus Protobuf
Locus JSON

Miscellaneous Formats

These formats store other kinds of information, or are precomputed indexes to speed up operations on other data.

Name Description Extension Status Notes
Distance Index (v1)
Distance Index (v2)
GCSA
Minimizer Index
BED
Snarl Protobuf
Snarl JSON
SnarlTraversal Protobuf
SnarlTraversal JSON
Node ID Translation
VG Protobuf Index
GAM Index
FASTA Index
BAM Index
Tabix VCF Index
Clone this wiki locally