-
Notifications
You must be signed in to change notification settings - Fork 194
SV Genotyping and variant calling
While this pipeline will genotype all types of variation, it has been mostly tested on and used for Structural Variation.
SVs can be encoded in a variation graph in the same manner as any other type of variant. Large insertions and deletions can likewise be encoded in VCF in the same way as smaller variants, by specifying the relevant sequences in the REF and ALT fields. The VCF format also contains support for "symbolic alleles" for SVs. Of these, vg construct
supports <INS>, <DEL>
and <INV>
.
Graphs with symbolic alleles can be constructed with vg construct -S
:
vg construct -f -S -I insertions.fa -a -r ref.fasta -v vars.vcf.gz > graph.vg
insertions.fa
is a FASTA file containing the sequences of insertions from the VCF (i.e. the allele fields in the VCF match the FASTA identifiers of the relevant sequence).
Symbolic SV support is very sensitive to a variety of VCF INFO fields and, in part due to the looseness of the specification, often doesn't work on a given VCF. It is therefore recommended to just convert symbolic VCFs to explicit VCFs before construction, when possible, to avoid these pitfalls.
The required input is
- a graph, ideally in
.gbz
format, but any vg-supported format such as.vg
,.gfa
or.xg
will do. - a read alignment in either
.gaf.gz
,.gaf
or.gam
format. Tools to produce these includevg giraffe
,vg map
,vg mpmap
andGraphAligner
.
# Compute the read support (use -a instead of -g for .gaf.gz input)
vg pack -x whole-genome.gbz -g whole-genome.gam -o whole-genome.pack -Q 5
# Compute the snarls (using fewer threads with `-t` can reduce memory at the cost of increased runtime)
vg snarls whole-genome.gbz > whole-genome.snarls
# Genotype the graph (add -a to genotype all sites including 0/0)
# The -z option restricts possible alleles to haplotypes in the GBZ which is usually faster and more accurate but only applies to GBZ input
vg call whole-genome.gbz -r whole-genome.snarls -k whole-genome.pack -s <sample> -z > genotypes.vcf
If an insertion fasta file was needed to construct
with -I
, it must be passed to call
with -i
.
If the input graph was created with vg construct -a
, then it can be used to genotype the original VCF file (as opposed to the above approach which may shift around variants due to the round trip between the graph and VCF representations).
This only works if the graph was constructed with vg construct -a -v whole-genome-vars.vcf.gz
.
# create an XG index (-L option must be used)
vg index whole-genome.vg -L -x whole-genome.xg
# Compute the read support (use -a instead of -g for .gaf.gz input)
vg pack -x whole-genome.gbz -g whole-genome.gam -o whole-genome.pack -Q 5
# Compute the snarls (using fewer threads with `-t` can reduce memory at the cost of increased runtime)
vg snarls whole-genome.gbz > whole-genome.snarls
# Genotype the graph
# the VCF passed passed in with -v must be the same as used to construct the graph!
vg call whole-genome.gbz -r whole-genome.snarls -k whole-genome.pack -s <sample> -v whole-genome-vars.vcf.gz > genotypes.vcf
If an insertion fasta file was needed to construct
with -I
, it must be passed to call
with -i
.
These commands only work when the graph is constructed with vg construct -a
from the VCF in question. A graph made with vg autoindex
will not work. To use this pipeline with giraffe
, you therefore need to index the graph yourself. In both cases it is crucial that the INPUT VCF IS PHASED:
vg construct -a -r ref.fasta -v test.vcf > test.vg
vg index test.vg -L -x test.xg
vg index test.vg -v test.vcf -G test.gbwt
vg gbwt -x test.xg test.gbwt -g test.gbz --gbz-format
vg index test.gbz -j test.dist
vg minimizer test.gbz -d test.dist -o test.min
You can now use test.gbz
, test.dist
and test.min
as input for vg giraffe
and test.xg
as input for vg call
.
Warning de novo variant calling as described in this section does not work for SVs. And even for small variants like SNPs and indels you will often get better performance by using
vg surject
to project your mappings into a BAM file, then using the BAM file with a linear caller such as DeepVariant or FreeBayes.
This is a simpler but slower method than that outlined in the next section. The results should be identical.
Calling variants from the graph requires augmenting the graph from the reads. This step cannot be performed on an XG index, as its read-only. In addition, VG-format graphs are too inefficient. Until the entire toolset is updated to use a more efficient format, the first step is to convert the graph into packed-graph format.
# convert the graph (can be vg or xg)
vg convert whole-genome.xg -p > whole-genome.pg
Now the graph can be called by augmenting it before running vg call
. It is important to use vg augment -m
to set a minimum coverage. The value should be chosen based on the average coverage of the GAM (obtainable by vg depth
). Typically, a value of about 10% of the depth is reasonable. An adaptive version that chooses it automatically is in the works. We will assume 30X coverage and use -m 4
below.
# augment the graph, filtering out mappings and bases with quality < 5, and breakpoints with coverage < 4
vg augment whole-genome.pg whole-genome.gam -m 4 -q 5 -Q 5 -A whole-genome-aug.gam > whole-genome-aug.pg
# compute the snarls
vg snarls whole-genome-aug.pg > whole-genome-aug.snarls
# compute the support
vg pack -x whole-genome-aug.pg -g whole-genome-aug.gam -o whole-genome-aug.pack
# call variants on every path in the graph
vg call whole-genome-aug.pg -r whole-genome-aug.snarls -k whole-genome-aug.pack -s <sample> > whole-genome-calls.vcf
vg augment
can take upwards of 10 hours on human-genome sized graphs. When resources are available it can be faster to parallelize this step by splitting the graph into components. Below, we will assume a graph created using vg construct
in which case each reference path will correspond to a connected component. This is not a requirement, however. Connected components can be computed by vg chunk
without using paths by using -C
instead of -M
.
Splitting the graph into components (chromosomes in this case) can be done as follows. -M
makes a component for each path in the graph. A subset can be selected by, ex, replacing -M
with -C -p chr1 -p chr2
vg chunk -x whole-genome.xg -M -O pg
If the paths in the graph are of the form 1 2 3 4 ... X Y
, this command will result in files chunk_1.pg, chunk_2.pg ... chunk_Y.pg
being created in the current directory. And each one can be called (using the whole-genome GAM) as described in the previous section, in parallel if desired.
# augment the graph, filtering out mappings and bases with quality < 5, and breakpoints with coverage < 4
# the -s option is crucial
vg augment chunk_1.pg whole-genome.gam -s -m 4 -q 5 -Q 5 -A chunk_1_aug.gam > chunk_1_aug.pg
# compute the snarls
vg snarls chunk_1_aug.pg > chunk_1_aug.snarls
# compute the support
vg pack -x chunk_1_aug.pg -g chunk_1_aug.gam -o chunk_1_aug.pack
# call variants on the component
vg call chunk_1_aug.pg -r chunk_1_aug.snarls -k chunk_1_aug.pack -s <sample> > 1.vcf
On a distributed computing system, it may be inefficient to copy the whole-genome GAM to each augment job. If desired, the GAM can be split into components as well by passing it to vg chunk
by adding -g -a whole-genome.gam
to its invocation. This will create chr_1.gam, chr_2.gam etc.
that can be used in place of whole-genome.gam in the above steps.