Skip to content

Giraffe best practices

Jouni Siren edited this page May 28, 2023 · 8 revisions

Use a recent version of vg. Pangenome tools are more complex and less mature than the software tools you are likely used to. There is usually a new vg release in every 6 weeks that fixes issues people have identified.

Inputs

Graphs

Giraffe uses a GBZ graph that combines the graph with haplotype information. The graph must be provided with option -Z / --gbz-name. Using other graph types with Giraffe is not recommended.

Distance index and minimizer index

In addition to the graph, Giraffe also needs a distance index and a minimizer index. These can be specified with options -d / --dist-name and -m / --minimizer-name. Specifying the index files explicitly is recommended in scripts and especially in production use.

If the indexes are not specified and the graph is named graph.gbz or graph.giraffe.gbz, Giraffe will guess that the indexes are in files graph.dist and graph.min. If these files do not exist, Giraffe will try to rebuild the indexes. This can cause issues when running multiple Giraffe jobs using the same indexes.

The distance index is a memory-mapped file. As of vg 1.48.0, the file will be opened in read+write mode by default. This can cause issues in HPC clusters and other distributed environments, where multiple computers try to access the same distance index file. To avoid this, make the file read-only.

Giraffe relies on distance index annotations in the minimizer index. Without them, mapping speed will be slow. Minimizer indexes built using vg autoindex always contain them. If you build the minimizer index manually using vg minimizer, you must specify the distance index using option -d / --distance-index.

Reads

The reads can be either in FASTQ format (option -f / --fastq-in) or in GAM format (-G / --gam-in). A FASTQ file can be gzip-compressed.

By default, Giraffe does single-end mapping. For paired-end mapping, either specify two FASTQ files or use option -i / --interleaved with a single interleaved input file.

When doing paired-end mapping, Giraffe assumes that the first few thousand reads are a representative sample of the overall file and uses them for estimating fragment length distribution. If the input file is sorted, this does not work, and you must specify the fragment length distribution using options --fragment-mean and --fragment-stdev.

Clone this wiki locally