Skip to content

Releases: vgteam/vg

Resubmission

23 Apr 09:51
@ekg ekg
8224e9d
Compare
Choose a tag to compare

This release roughly coincides with resubmission of the vg paper. There have been a lot of changes since v1.6.0 and it would be difficult to provide a specific summary, but several items deserve mention.

  • We improve vg's mapping runtime,
  • integrate the GBWT,
  • and refresh some critical components of the indexing process (pruning and kmer generation).
  • We explore haplotype-based alignment scoring using the GBWT,
  • and enable using it as the base thread storage system for all the tools in vg.
  • We greatly improved the accuracy of the long read alignment algorithm in vg map.

Submission

12 Dec 16:48
@ekg ekg
1f8d1fa
Compare
Choose a tag to compare

This checkpoints the version that was used to generate experimental results in the preprint of our manuscript. (It will be linked here when made public.)

Attached is a stripped static binary that is portable across all 64-bit Linux systems that we have access to.

Paper freeze

20 Mar 15:54
@ekg ekg
Compare
Choose a tag to compare

Recent developments in vg have focused on improving the mapper's performance via simulations.

As a result of this work, our simulations show that when aligning reads simulated from the 1000G variation graph, vg is equivalent to bwa mem in performance on the linear reference, and bests it considerably when aligning against the pangenome:

image

Experiments are completed with scripts/map-sim. The above example can be reproduced by:

vg/scripts/map-sim human.10M hs37d5.fa ref/index pan/index 32 "-s 571 -n 500000 -e 0.01 -i 0.002 -l 150 -p 500 -v 50" " "

Runtime is between 5 and 10x that of bwa mem.

Sendai

19 Feb 02:57
@ekg ekg
Compare
Choose a tag to compare

There have been 286 commits since v1.3.0, so it is hard to summarize everything that's happened. Here are some of the highlights:

  • RDF export (@JervenBolleman)
  • improvements in build process (@edawson)
  • chop/unchop procedures (@ekg)
  • maintain paths in graph normalization (@ekg)
  • graph unrolling/unfolding/dagification (@ekg)
  • new GFA parsing (@edawson)
  • improvements to MSGA (@ekg)

mapping ranks

04 Nov 15:08
@ekg ekg
Compare
Choose a tag to compare

Since v1.2.0:

  • Improvements in MSGA allow construction of the GA4GH bakeoff regions (however, problems with alignment are breaking the paths that are included).
  • Mappings now include ranks, so that it's easy to obtain subsets of paths and merge subsets of graphs without needing to keep the entire path around.
  • Better validation of paths by using ranks on mappings.

The changes in format necessitated changes in xg, and any indexes from older versions will be invalidated with this update.

giant leaps

23 Oct 10:58
@ekg ekg
Compare
Choose a tag to compare

Notable improvements in this release:

  1. MSGA -- a kind of alignment-driven overlap-based assembler that can align large genomes; vg's answer to cactus.
  2. call -- variant calling on the graph
  3. index -- index huge graphs!
  4. map -- map against huge graphs!

Indexing improvements

Assuming you've started with a big graph (and that node ids 1000000000 and 1000000001 are outside of your node space), let's start by reducing the complexity of the graph so as to bound memory requirements during GCSA2 construction. We break the graph where we cross more than (approximately) 3 bifurcating edges in 16bp, and when fragments are less than 100bp in total sequence we remove them. This effectively masks out high-complexity regions of the graph. Despite the complexity reduction, the resulting system is sufficient for seeding global mapping:

vg mod -pl 16 -t 16 -e 3 $chr.vg \
    | vg mod -S -l 32 - >$chr.prune.vg
vg kmers -gB -k 16 -F -H 1000000000 -T 1000000001 -t 16 \
    $chr.prune.vg >$chr.graph

Now, we feed the resulting graphs, which binary formats for de Bruijn-like graphs, into GCSA2's build_gcsa tool:

build_gcsa -d 1 -o wg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.graph'")

The resulting GCSA index is wg.gcsa.

Now, we build an xg index:

vg index -x wg.xg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.vg'")

We can then drive our mapper using these indexes, and it only requires 20G of RAM!

zcat NA12878.x10.13502_7.fq.gz | head -4000000 | vg map -x wg.xg -g wg.k16e3l100d1.gcsa -if - -t 32 -k 27 -GFX 1.5 -T 50 -l 12 -S 5 >aln.gam

I'll generate a more-detailed writeup for the wiki, but hopefully this satisfies the very-curious.

MSGA

You can build graphs of sets of sequences by feeding them in in fasta format using vg msga.

vg msga -f ref.fa >ref.vg

Fun!

✩✭✮✫✯✨

28 Aug 15:20
@ekg ekg
Compare
Choose a tag to compare

Since v1.0.0:

🖬 111606B of patches.

  • multi mapping (thanks to @adamnovak )
  • gcsa2 integration (nod to @jltsiren for helping me debug my many buggy attempts)

vg is now capable of generating GCSA and xg indexes of graphs! By default it will index the forward and reverse complement of any variation graph you give it, and allow you to construct queries of up to . At present, you can query the indexes using vg find, but stay tuned because mapping using the new indexes is top priority for v1.2.0.

You can try it out with this executable and the tarball (or your own copy of the repo):

test git:(master) ✗ time vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg -p
 loading variants for z         [==============================================]100.0%
 parsing variants               [==============================================]100.0%
 enforcing node size limit      [==============================================]100.0%
 planning construction          [==============================================]100.0%
 constructing graph             [==============================================]100.0%
 merging remaining graphs       [==============================================]100.0%
 joining graphs                 [==============================================]100.0%
 topologically sorting          [==============================================]100.0%
 compacting ids                 [==============================================]100.0%
 saving graph                   [==============================================]100.0%
vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 -p > z.vg  5.99s user 0.32s system 123% cpu 5.114 total

And then we'll index it with xg and gcsa2, using an initial kmer size of 16 and 3 doubling steps, which will yield a GCSA index that lets us find the positions of sequences of up to 16 * 2^3 = 128 in the graph. Meanwhile the xg index provides near-optimal query performance over elements of the graph (nodes, edges, paths, and path-relative positions) using minimal space.

➜  test git:(master) ✗ time vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg         
GCSA::prefixDoubling(): Initial path length 16
  mergePaths(): 3370686 -> 3265575 paths
  mergePaths(): 2988126 unique, 277449 unsorted, 0 nondeterministic paths
GCSA::prefixDoubling(): Step 1 (path length 16 -> 32)
  joinPaths(): 3265575 -> 3756204 paths (8280486 ranks)
  mergePaths(): 3756204 -> 3741775 paths
  mergePaths(): 3607232 unique, 113082 unsorted, 21461 nondeterministic paths
GCSA::prefixDoubling(): Step 2 (path length 32 -> 64)
  joinPaths(): 3741775 -> 6328908 paths (21039927 ranks)
  mergePaths(): 6328908 -> 6314604 paths
  mergePaths(): 6080484 unique, 61311 unsorted, 172809 nondeterministic paths
GCSA::prefixDoubling(): Step 3 (path length 64 -> 128)
  joinPaths(): 6314604 -> 8253191 paths (36664975 ranks)
  mergePaths(): 8253191 -> 8252375 paths
  mergePaths(): 7786268 unique, 39280 unsorted, 426827 nondeterministic paths
GCSA::mergeByLabel(): 8252375 -> 8017312 paths
GCSA::build(): 10511607 edges
GCSA::sample(): 2558469 samples at 2376049 positions
Index verification complete.
vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg  234.44s user 1.80s system 310% cpu 1:15.97 total

We can find things in it using a few features of vg find. Let's simulate a perfect read using vg sim and then look it up in the index:

test git:(master) ✗ vg sim -f -n 1 -l 64 z.vg    
TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
➜  test git:(master) ✗ vg find -g z.vg.gcsa -S TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
58533:0

Also, we can also pick up pieces of the graph from the xg index:

test git:(master) ✗ vg find -x z.vg.xg -p z:200-300 | vg mod -o - | vg view - | head
H       HVN:Z:1.0
S       25      CCACATTTCCATTCACCCATCCTC
P       25      z       +       24M
L       25      +       26      +       0M
S       26      C
P       26      z       +       1M
L       26      +       28      +       0M
S       28      CATCCATCAACCCTCCAATCC
P       28      z       +       21M
L       28      +       29      +       0M

Indexing takes about the same amount of time as using the old strategy:

test git:(master) ✗ time vg index -s -k 27 -e 7 z.vg
vg index -s -k 27 -e 7 z.vg  140.62s user 1.22s system 222% cpu 1:03.76 total

But uses less disk and RAM.

1.2M    z.vg            # the graph itself
32M     z.vg.index  # old index
2.6M    z.vg.xg       # xg index
18M     z.vg.gcsa   # 128mer gcsa

And the comparison isn't apples-to-apples. We've made something altogether much more powerful. We can query the graph for not just 27-mers on the forward strand but all 128-mers, both on the forward and reverse strands!

Both the .gcsa and .xg indexes use approximately the same amount of memory on disk and in ram, so we'd need about 21M to load them into ram. This is about 17 times the size of the input graph. Extrapolating, we should expect to use about 50G of for the whole genome.

Now, if you're itching to try to build such indexes for a human whole-genome variation graph, you will need to mask out some small regions of apparent high complexity (likely misalignments or highly divergent haplotypes):

(NB: I haven't tried this. But, when I do, this is how I will 😉)

# for each chromosome, reduce the graph complexity by masking
# complex regions yielding a lot of short subgraphs
# and retaining the original id space
for chr in $(seq 1 22; echo X; echo Y);
do
    vg mod -pl 16 -e 6 $chr.vg | vg mod -S -l 50 - >>in.vg
done
# now use these graphs as input to the indexer
# using only 2 doubling steps, and indexing only the forward strand
time vg index -x wg.xg -d wg.gcsa -g -k 16 -X 2 -V -F in.vg

Reboot!

image

first binary distribution

25 Aug 11:28
@ekg ekg
Compare
Choose a tag to compare

Let's call this version 1, if for no particular reason other than that I have a static linux 64-bit binary to distribute. Linux users should be able to download and run the attached executable after a quick chmod +x ./vg.