BayesTyper performs genotyping of all types of variation (including SNPs, indels and complex structural variants) based on an input set of variants and read k-mer counts. Internally, BayesTyper uses exact alignment of k-mers to a graph representation of the input variants and reference sequence in combination with a probabilistic model of k-mer counts to do genotyping.
-
28 January 2019: Patch (v1.4.1)
- Updating to this patch is highly recommended since it fixes a bug introduced in v1.4 that resulted in
bayesTyper genotype
occasionally crashing on larger datasets (see release notes).
- Updating to this patch is highly recommended since it fixes a bug introduced in v1.4 that resulted in
-
18 October 2018: New release (v1.4) featuring:
- Sparsity estimation: Fixed bug when estimating the sparsity parameter used for the population prior. This fix should result in better estimates for complex clusters.
- Ploidy input file: The ploidy of each chromosome for each gender (female and male) can now be specified using
--chromosome-ploidy-file
in bayesTyper genotype. Ploidy levels 0, 1 (haploid) and 2 (diploid) are supported. Human ploidy levels are assumed if no file is given (see wiki for more details). - Genomic parameter estimation: Genomic parameters are now estimated using either haploid or diploid k-mers. The ploidy level with the highest number of informative k-mers is used for estimation.
- Noise parameter estimation: Noise parameters are now estimated using SNVs across all supported ploidy levels. In addition, SNVs in clusters are now also used in parameter estimation.
- Error handling: Incorrect inputs now produces more informative error messaging.
Because it allows you to obtain accurate genotypes spanning from SNVs/short indels to complex structural variants and hence provides a more complete picture of the genome as compared with standard methods - without sacrificing accuracy.
Standard methods for genotyping (e.g. GATK-HaplotypeCaller, Platypus and Freebayes) start from an alignment of reads (e.g. by BWA-MEM) and then
- Call candidate variants.
- Perform local realignment of reads anchored to a particular variable region to candidate variant haplotypes.
- Estimate genotypes (and thus final variant calls).
This approach can result in a bias towards the reference sequence since reads informative for a particular variant may have been left either unaligned (because of too large an edit distance to the reference) or have aligned better elsewhere in the reference.
The variant graph approach used by BayesTyper ensures that the resulting calls are not biased towards the reference sequence by effectively realigning all reads (or more specifically their k-mers) when genotyping candidate variants. In our recent paper, we show how this approach significantly improves both sensitivity and genotyping accuracy for most variant types - especially non-SNVs (please see citation below).
-
Download the latest static Linux x86_64 build (k=55) found under releases.
- To build from source, please refer to the build wiki for detailed build instructions. Note that the k-mer size is determined compile time. Hence, if you want k ≠ 55 you need to compile it yourself (or post a feature request and we will see what we can do).
-
Download the BayesTyper human data bundle (GRCh37 and GRCh38) containing reference sequences preprocessed for BayesTyper (i.e. canonical and decoy chromosomes) together with a reference matched variation prior database.
The BayesTyper genotyping process occurs in two stages:
- Generation of variant candidates (using other tools)
- Genotyping based on variant candidates (using BayesTyper)
As indicated, BayesTyper does not find candidate variants on its own. Instead, users can combine the variant discovery strategies suitable for their study as it will depend on the study design (e.g. coverage, number of samples etc.) as well as the available resources.
Below we outline an example strategy, where candiates are obtained using
- GATK-HaplotypeCaller (without all the pre-processing of alignments and post-processing of variants)
- Platypus
- Manta
- The variation prior (part of the BayesTyper data bundle, see Installation)
The complete workflow (i.e. BAM(s) to genotypes) outlined below is further provided as a snakemake workflow for easy deployment of BayesTyper. Please refer to the snakemake wiki for detailed instructions on how to set up and execute the workflow on your data.
Important: This workflow should work well for most cases. If you prefer to use another approach, please note that the candidate variant set should ideally contain at least 100,000 candidate SNVs that are not clustered with structural variants (needed for accurate estimation of parameters).
Important: Please note that it is currently only possible to genotype 30 samples at the time using BayesTyper. To run more samples, please execute BayesTyper in batches as described in the batching wiki. Batching is currently not supported by the snakemake workflow - please let us know if you require this feature by filing a feature request.
Important: Bayestyper supports uncompressed and gzip compressed vcf files. Please note that bgzip compression is currently not supported.
Important: If you intend to run on non-human data, please refer to the Running on non-human data wiki for more information.
Starting from a set of indexed, aligned reads (obtained e.g. using BWA-MEM):
-
For each sample, run HaplotypeCaller to get standard mapping-based candidates
- Running Base Quality Recalibration or doing joint genotyping is not required
- Faster alternative: Freebayes
-
For each sample, run Platypus to identify small and medium sized variants
-
For each sample, run Manta to identify candidates by de novo local assembly (important for detecting larger deletions and insertions). Convert allele IDs (e.g. <DEL>) in the Manta output to sequences using
bayesTyperTools convertAllele
-
For each caller, left-align and normalize variants using
bcftools norm
(bcftools) -
Combine variants across all samples, callers and the variation prior using
bayesTyperTools combine -v GATK:<gatk_sample1>.vcf,GATK:<gatk_sample2>.vcf,PLATYPUS:<platypus_sample1>.vcf,PLATYPUS:<platypus_sample2>.vcf,MANTA:<manta_sample1>.vcf,...,prior:<prior>.vcf -o <candiate_variants_prefix> -z
- Note: The source tag before the colon (e.g. GATK) only serves to identify the origin of the calls in the final callset - it can take any value.
- Important:
bayesTyperTools combine
requires the vcf header to contain contig entries (e.g.##contig=<ID=8,length=146364022>
) for all reference sequences containing variants in the vcf; the contigs further need to appear in the same order in the header and for the variant entries.
-
Count k-mers
- Run KMC3 on each sample (set k=55 using
-k55
and include singleton k-mers using-ci1
)- Note: Default is fq(.gz) input - bam input is enabled using
-fbam
. - Important: If the reads are in bam format, make sure the file also contains the unmapped reads.
- Note: Default is fq(.gz) input - bam input is enabled using
- Create a read k-mer bloom filter for each sample from the KMC3 output using
bayesTyperTools makeBloom -k <kmc_output_prefix> -p <num_threads>
- Important: The resulting bloom filter (<sample_id>.bloomMeta and <sample_id>.bloomData) and the KMC3 output (<sample_id>.kmc_pre and <sample_id>.kmc_suf) must reside in the same directory and have the same prefix
- Run KMC3 on each sample (set k=55 using
-
Identify variant clusters:
bayesTyper cluster -v <candiate_variants_prefix>.vcf.gz -s <samples>.tsv -g <ref_build>_canon.fa -d <ref_build>_decoy.fa -p <num_threads>
- Note: This partitions the candidate variants into units written to separate directories (bayestyper_unit_1, bayestyper_unit_2, ...) - each containing between 5M and 10M variants. These can be genotyped independently e.g. on a cluster (supported by the snakemake workflow) followed by a simple concatenation operation using
bcftools concat
(see below). - Important: The
<samples>.tsv
file should contain one sample per row with columns <sample_id>, <sex> and <kmc_output_prefix> and no header (example). - Important: Data that is common to all units and necessary for genotyping is written to the bayestyper_cluster_data directory.
- Note: Human reference files (canon/decoy) are provided in the BayesTyper data bundle (see Installation).
- Note: This partitions the candidate variants into units written to separate directories (bayestyper_unit_1, bayestyper_unit_2, ...) - each containing between 5M and 10M variants. These can be genotyped independently e.g. on a cluster (supported by the snakemake workflow) followed by a simple concatenation operation using
-
Genotype variant clusters:
bayesTyper genotype -v bayestyper_unit_<unit_id>/variant_clusters.bin -c bayestyper_cluster_data -s <samples>.tsv -g <ref_build>_canon.fa -d <ref_build>_decoy.fa -o bayestyper_unit_<unit_id>/bayestyper -z -p <num_threads>
- Note: The genotype command also applies the default BayesTyper hard-filters by setting the variant FILTER status and the sample specific allele filter (SAF) format attribute. Please refer to the filter wiki for information about the filters used, how to changes the defaults up front and how to refilter the data after running the genotyping step.
- Important: The filtering procedure only filters the genotypes ("./.") and hence, downstream tools should prefilter on the variant quality and FILTER=="PASS" (e.g. using bcftools) to obtain the filtered calls.
-
Concatenate units using bcftools:
bcftools concat -O z -o <output_prefix>.vcf.gz bayestyper_unit_1/bayestyper.vcf.gz bayestyper_unit_2/bayestyper.vcf.gz ...
- Important: Unit arguments to bcftools should be in ascending order (unit_1 unit_2 ...) for the output to be properly sorted
Number of samples | Coverage | Number of variant alleles | Max allele length (nts) | Number of threads | Wall time (h, single node) | Wall time (h, multiple nodes)* | Max memory (GB) |
---|---|---|---|---|---|---|---|
3 | 13x | 21.4M | 500,000 | 28 | 5-6 | 2-3 | 41 |
3 | 13x | 64.4M | 500,000 | 28 | 17-18 | 4-5 | 42 |
13 | 50x | 11.7M | 10,000 | 28 | 31-32 | 16-17 | 66 |
13 | 50x | 61.1M | 10,000 | 28 | 92-93 | 15-16 | 62 |
* bayesTyper genotype
can be distributed across nodes on a cluster - between 2 and 11 nodes were used in this benchmark.
The time estimates are for running bayesTyper cluster
and bayesTyper genotype
only. Expect <1h combined run-time per sample for counting k-mers by KMC3 and bloom filter creation by bayesTyperTools. All runs were done on a 64-bit Intel Xeon 2.00 GHz machine with 128 GB of memory using v1.3.
Sibbesen JA*, Maretty L*, The Danish Pan-Genome Consortium & Krogh A: Accurate genotyping accross variant classes and lengths using variant graphs. Nature Genetics, 2018. link. *Equal contributors.
- Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature, 2017 (link)
- A high-coverage Neandertal genome from Vindija Cave in Croatia. Science, 2017 (link)
- Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLOS Genetics, 2017 (link)
- Assembly and analysis of 100 full MHC haplotypes from the Danish population, Genome Research, 2017 (link)
Please let us know if you use BayesTyper in your publication - then we will put it on the list.
Please post an issue if you have questions regarding how to run BayesTyper, if you want to report bugs or request new features. You can also reach us at jasi at binf dot ku dot dk or lasse dot maretty at clin dot au dot dk.
We thank the developers of the third-party libraries used by BayesTyper: