Skip to content

Multipath alignments and vg mpmap

Jordan Eizenga edited this page Apr 5, 2019 · 30 revisions

This wiki page describes a graph-based alignment concept that we sometimes work with in vg and describes how to produce and work with this alignment concept using the vg software toolkit.

The multipath alignment concept

Most of the sequence-to-graph alignment field has focused on aligning a sequence to a path through a graph. For instance, this is the alignment concept behind the GAM file format that is produced by vg map. In some applications, however, it can useful to work with a more general alignment concept, which we call a multipath alignment. The same way a genome reference graph is a compressed representation of a collection of genomes, a multipath alignment is a compressed representation of a collection of alignments to these genomes. We allow the multipath alignment to bifurcate and rejoin so that it can align the same part of a sequence to multiple paths through the graph. Another way to think about this is that the multipath alignment is a graph of partial alignments. If you concatenate the partial alignments along a path in the multipath alignment's graph, it forms a sequence-to-path alignment like those contained in a GAM record.

A diagrammatic illustration

This diagram demonstrates the multipath alignment concept. The read (top) is aligned to a graph (middle) as a sequence-to-path alignment (bottom left) and a multipath alignment (bottom right). Notice that multipath alignment aligns the same part of the read to multiple places in the graph, all of which it considers plausible. Also notice that the sequence-to-path alignment corresponds to a single path through the multipath alignment.

Multipath alignments in vg

In vg, multipath alignments are stored in the GAMP (Graph Alignment MultiPath) format, which uses the extension .gamp. Like many formats in vg, GAMP is a Protocol Buffer-based format with a schema defined in vg.proto. For easy investigation, it can be converted to a JSON object using vg view. Let's take a look at an example (with some manual formatting).

# -K for GAMP input, -j for JSON output
> vg view -K -j example.gamp 
{"sequence":"GGGGTTTCACCGTGTTAGCCAGGATGGTC",
 "quality":"GyEhISEhGyEhISEhISEhISEhFhYbDwYGDwYGDw8=",
 "name":"NAME-OF-READ",
 "sample_name":"NAME-OF-SAMPLE",
 "read_group":"NAME-OF-GROUP",
 "start":[0],
 "subpath":[
    {"path":{"mapping":[{"position":{"node_id":"1613"},"edit":[{"from_length":3,"to_length":3}],"rank":"1"}]},
     "next":[1,2],
     "score":8
    },
    {"path":{"mapping":[{"position":{"node_id":"1615"},"edit":[{"from_length":1,"to_length":1,"sequence":"G"}],"rank":"2"}]},
     "next":[3],
     "score":-4
    },
    {"path":{"mapping":[{"position":{"node_id":"1614"},"edit":[{"from_length":1,"to_length":1}],"rank":"2"}]},
     "next":[3],
     "score":1
    },
    {"path":{"mapping":[{"position":{"node_id":"1616"},"edit":[{"from_length":5,"to_length":5}],"rank":"1"},
                        {"position":{"node_id":"1617"},"edit":[{"from_length":1,"to_length":1}],"rank":"2"},
                        {"position":{"node_id":"1619"},"edit":[{"from_length":1,"to_length":1}],"rank":"3"},
                        {"position":{"node_id":"1621"},"edit":[{"from_length":5,"to_length":5}],"rank":"4"},
                        {"position":{"node_id":"1622"},"edit":[{"from_length":1,"to_length":1}],"rank":"5"},
                        {"position":{"node_id":"1624"},"edit":[{"from_length":2,"to_length":2}],"rank":"6"},
                        {"position":{"node_id":"1625"},"edit":[{"from_length":1,"to_length":1}],"rank":"7"},
                        {"position":{"node_id":"1627"},"edit":[{"from_length":9,"to_length":9}],"rank":"8"}]
             },
      "score":30
     }
  ]
}

Some of the fields will be familiar to users of the SAM/BAM file formats: sequence, name, sample_name, and read_group all are analogous fields. The quality field refers to base quality, but the JSON rendering will express the values in Base64 encoding rather than Phred. The rest of the fields encode the topology of the multipath alignment. Each of the partial alignments corresponds to one subpath record. Each subpath contains three pieces of information:

  • A path that expresses a sequence-to-path alignment of some portion of the read
  • A score for that part of the alignment
  • A list of next subpaths which could be concatenated to the end of this partial alignment (i.e. the edges of the multipath alignment graph), referred to by their 0-based index in the array of subpaths. Finally, the optional field starts indicates all of the subpaths, again referred to by 0-based index, that could be the first subpath in a path of partial alignments (i.e. the source nodes in the multipath alignment graph).

In this example there was only one of these records in the GAMP file, but in general this is not the case. A GAMP file typically consists of a series of these records, each indicating a multipath alignment of one sequence to a graph.

Clone this wiki locally