Skip to content

Path Metadata Model

Jouni Siren edited this page Nov 9, 2023 · 7 revisions

There is a PathMetadata interface which all PathHandleGraphs, and consequently all of the graph types and files used in vg, implement. This page explains the way in which we model path metadata, how that model is implemented in different graph implementations and formats, and how this affects end users of vg tools trying to do analyses.

See also: Changing References

Data Model

You can see an example of path metadata by running, from the repository's test directory:

vg paths --metadata -x test/graphs/rgfa_with_reference.rgfa

This will produce TSV data:

#NAME	SENSE	SAMPLE	HAPLOTYPE	LOCUS	PHASE_BLOCK	SUBRANGE
sample1#2#chr1#0	HAPLOTYPE	sample1	2	chr1	0	NO_SUBRANGE
CHM13#0#chr1#0	HAPLOTYPE	CHM13	0	chr1	0	NO_SUBRANGE
coolgene[1]	GENERIC	NO_SAMPLE_NAME	NO_HAPLOTYPE	coolgene	NO_PHASE_BLOCK	1
GRCh38#0#chr1	REFERENCE	GRCh38	0	chr1	NO_PHASE_BLOCK	NO_SUBRANGE
GRCh37#0#chr1#0	HAPLOTYPE	GRCh37	0	chr1	0	NO_SUBRANGE
sample1#1#chr1#0	HAPLOTYPE	sample1	1	chr1	0	NO_SUBRANGE
coolgene[7]	GENERIC	NO_SAMPLE_NAME	NO_HAPLOTYPE	coolgene	NO_PHASE_BLOCK	7

Formatted as a table, that is:

#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
sample1#2#chr1#0 HAPLOTYPE sample1 2 chr1 0 NO_SUBRANGE
CHM13#0#chr1#0 HAPLOTYPE CHM13 0 chr1 0 NO_SUBRANGE
coolgene[1] GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE coolgene NO_PHASE_BLOCK 1
GRCh38#0#chr1 REFERENCE GRCh38 0 chr1 NO_PHASE_BLOCK NO_SUBRANGE
GRCh37#0#chr1#0 HAPLOTYPE GRCh37 0 chr1 0 NO_SUBRANGE
sample1#1#chr1#0 HAPLOTYPE sample1 1 chr1 0 NO_SUBRANGE
coolgene[7] GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE coolgene NO_PHASE_BLOCK 7

From this, we can see that every path has:

  1. A name, which is a string that uniquely identifies the path. This can be in PanSN format, and may have an additional trailing #-delimited or []-enclosed field.
  2. A sense. A pathc can be exactly one of haplotype sense (representing a haplotype that a particular individual has for part of a contig), reference sense (representing a path taken as part of a haploid or diploid linear reference like GRCh38 or CHM13), or generic sense (representing something else, like a gene or an aligned read).
  3. A sample. For haplotypes, this is the identifier for the sampled individual, like NA19239 or HG003. For references, this is the name of the reference assembly, like GRCh38. For generic paths, this is unset.
  4. A haplotype number, identifying which haplotype of a sample the path belongs to. For haplotype paths, this would be 0 or 1 in a diploid organism. For reference paths, this is meant to be 0 in a haploid reference, and 1 or 2 as appropriate in a diploid reference. For generic paths, this is unset.
  5. A locus name. This indicates the chromosome or contig, within an assembly, which the path relates to. For a haplotype path derived from a VCF, this would be the VCF contig name that the haplotype is on, like chr1. For a haplotype path derived from an assembly, this would be the assembly contig name, like JAHALY010000007.1. For a reference path, this is the name of the contig within the reference assembly being expressed. For a generic path, this is the name of the thing that the generic path represents, such as a gene name or user-provided string.
  6. A phase block. For haplotype paths, this is used when a contig is not phased through end to end. In that case, there willb e multiple haplotype paths on the contig with different phase block values, with the paths cut apart where phasing is unknown. For reference and generic paths, this is unset; for those paths, you should instead use subrange when multiple pieces of some longer path are present.
  7. A subrange, which has a start and an optional end coordinate. Positions are 0-based, start-inclusive, and end-exclusive. When this field is used, the path in the graph is part of some larger path that is not entirely in the graph. Multiple paths in the graph can have the same values for all the other metadata fields, as long as their subranges do not overlap. This field is only used for reference and generic paths; it is always unset for haplotype paths. For haplotype paths, the phase block field does much the same thing and should be used instead.
Clone this wiki locally