Skip to content

Indexing Huge Datasets

Jouni Siren edited this page Dec 3, 2018 · 14 revisions

General

The index construction guidelines work best with datasets that are no larger than the 1000 Genomes Project (2500 human samples). This page describes techniques that may be useful when trying to index larger datasets.

GBWT construction for many samples

Many of these techniques use standalone GBWT tools in addition to vg. See the GBWT wiki for further documentation.

Haplotype identifiers

  • Version 1.11.0 or later

By default, GBWT stores haplotype identifiers at one out of 1024 positions on each path. The distance can be changed by using option -n N of vg index -G. This is a time/space trade-off. Each stored identifier typically takes 3 to 4 bytes of memory, while the time required for determining the haplotypes containing a particular subpath is proportional to the distance between stored identifiers. In a 1000GP GBWT with the default distance, the stored identifiers take roughly as much space as the GBWT itself. If the number of samples is larger, the distance should also be proportionally larger.

In the standalone tool deps/gbwt/build_gbwt, the corresponding option is -s N.

VCF parsing

  • Version 1.11.0 or later

Parsing a massive VCF file can take hours or even days. Because the parse does not depend on GBWT construction parameters other than -P, parsing the file only once and storing the intermediate data on disk may save time when trying different GBWT construction parameters. By replacing the option -G FILE with -e FILE, vg index will write the intermediate data to files with prefix FILE instead of building a GBWT index.

File FILE_contig will contain the reference path and the paths for each alternate allele for VCF contig contig, while files FILE_contig_X_Y will contain the phasing information for samples X to X+Y-1. These files can then be used as an input to the standalone GBWT construction tool deps/gbwt/build_gbwt.

In the following example, we index graph.vg, which contains VCF contigs chr1 and chr2. The output is written to graph.gbwt.

vg index -v variants.vcf.gz -e parse graph.vg
build_gbwt -p -o graph parse_chr1 parse_chr2

Merging GBWT indexes over the same contig

  • Version 1.13.0 or later (PR #2002; November 28, 2018)

The default GBWT merging algorithm (vg gbwt -m) is mostly suited for merging small indexes, as it is slower than direct construction. The fast algorithm (vg index -f) only works with GBWT indexes over different contigs. In order to merge GBWT indexes over the same contig quickly, we need the parallel algorithm from the standalone tool deps/gbwt/merge_gbwt.

In the following, we first parse graph.vg, which contains VCF contig chr1, and write the phasing information in batches of 200 samples. Then we build GBWT indexes for each batch using 8 parallel jobs. Finally we merge the partial indexes parse_chr1_*.gbwt into a single index graph.gbwt.

vg index -v variants.vcf.gz -e parse graph.vg
ls parse_chr1_*  | parallel -j 8 "build_gbwt -p -P {} -o {} parse_chr1"
merge_gbwt -p -o graph parse_chr1_*.gbwt

The thread identifiers are determined by the order of the partial indexes in the arguments of merge_gbwt. Thread databases cannot be merged at the moment.

GCSA construction for a dense graph

  • Version 1.12.0 or later
Clone this wiki locally