-
Notifications
You must be signed in to change notification settings - Fork 194
Giraffe best practices
Use a recent version of vg. Pangenome tools are more complex and less mature than the software tools you are likely used to. There is usually a new vg release in every 6 weeks that fixes issues people have identified.
Giraffe uses a GBZ graph that combines the graph with haplotype information. The graph must be provided with option -Z
/ --gbz-name
. Using other graph types with Giraffe is not recommended.
In addition to the graph, Giraffe also needs a distance index and a minimizer index. These can be specified with options -d
/ --dist-name
and -m
/ --minimizer-name
. Specifying the index files explicitly is recommended in scripts and especially in production use.
If the indexes are not specified and the graph is named graph.gbz
or graph.giraffe.gbz
, Giraffe will guess that the indexes are in files graph.dist
and graph.min
. If these files do not exist, Giraffe will try to rebuild the indexes. This can cause issues when running multiple Giraffe jobs using the same indexes.
The distance index is a memory-mapped file. As of vg 1.48.0, the file will be opened in read+write mode by default. This can cause issues in HPC clusters and other distributed environments, where multiple computers try to access the same distance index file. To avoid this, make the file read-only.
Giraffe relies on distance index annotations in the minimizer index. Without them, mapping speed will be slow. Minimizer indexes built using vg autoindex
always contain them. If you build the minimizer index manually using vg minimizer
, you must specify the distance index using option -d
/ --distance-index
.
The reads can be either in FASTQ format (option -f
/ --fastq-in
) or in GAM format (-G
/ --gam-in
). A FASTQ file can be gzip-compressed.
By default, Giraffe does single-end mapping. For paired-end mapping, either specify two FASTQ files or use option -i
/ --interleaved
with a single interleaved input file.
When doing paired-end mapping, Giraffe assumes that the first few thousand reads are a representative sample of the overall file and uses them for estimating fragment length distribution. If the input file is sorted, this does not work, and you must specify the fragment length distribution using options --fragment-mean
and --fragment-stdev
.