Skip to content

Commit

Permalink
Merge pull request #13 from TRON-Bioinformatics/support-indels
Browse files Browse the repository at this point in the history
Support indels
  • Loading branch information
priesgo authored Nov 10, 2021
2 parents 6a1374a + e630700 commit 3b46385
Show file tree
Hide file tree
Showing 21 changed files with 1,442 additions and 184 deletions.
122 changes: 84 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
[![Run unit tests](https://github.com/TRON-Bioinformatics/vafator/actions/workflows/unit_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/vafator/actions/workflows/unit_tests.yml)
[![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)

Annotate the variants in a VCF file with technical annotations from one or more BAM files.

Install from PyPI (`pip install vafator`) or from bioconda (`conda install bioconda::vafator`).
VAFator annotates the variants in a VCF file with technical annotations from multiple BAM files.
Supports annotating somatic variant calls with the annotations from the normal and the tumor samples; although
it can also be used for germline variant calls.

Annotations:

* **Allele frequency (AF)**: ratio of reads supporting the alternate allele. When more than one alternate allele is present then one value per alternate allele is provided.
* **Allele count (AC)**: count of reads supporting the alternate allele. When more than one alternate allele is present then one value per alternate allele is provided.
* **Allele frequency (AF)**: ratio of reads supporting the alternate allele.
* **Allele count (AC)**: count of reads supporting the alternate allele.
* **Depth of coverage (DP)**: number of reads covering the position of the variant

Outputs a VCF with the following annotations in the INFO field for tumor and normal:
Expand All @@ -22,44 +22,90 @@ chr1 12345 . A G . PASS tumor_af=0.0;tumor_ac=
chr2 12345 . A G,T . PASS tumor_af=0.2,0.2;tumor_ac=2,2;tumor_dp=10;normal_af=0.0,0.0;normal_ac=0,0;normal_dp=10
```

Both tumor and normal BAMs are optional, it can annotate only with the tumor BAM and viceversa.
**NOTE**: notice that VAFator does not annotate samples in the FORMAT field


## How to install

Install from PyPI (`pip install vafator`) or from bioconda (`conda install bioconda::vafator`).


## How to run

Run as follows:
```
vafator --input-vcf /path/yo/your.vcf \
--output-vcf /path/yo/your_vafator.vcf \
--normal-bams /path/to/your_normal.bam \
--tumor-bams /path/to/your_primary_tumor.bam,/path/to/your_metastasis_tumor.bam
```

Both tumor and normal BAMs are optional, it can annotate only with the tumor BAMs or with the normal BAMs.

If more than one BAM is provided for either the tumor or the normal then the annotations are calculated across all BAMs
and for also each of them separately (eg: `tumor_af` provides the allele frequency across all tumor BAMs, `tumor_af_1`
and `tumor_af_2` provide the allele frequency on the first and second BAM respectively).

Also, a `--prefix` parameter can be used to run VAFator multiple times over the same VCF without overwriting its
annotations. For instance, first `vafator [...] --prefix primary --tumor-bams /path/to/your_primary_tumor.bam` and
then `vafator [...] --prefix metastasis --tumor-bams /path/to/your_metastasis_tumor.bam`.

Use the parameters `--mapping-quality` and `--base-call-quality` to define the minimum quality values for each read.
All reads with quality values velow these thresholds will be filtered out.

Overlapping reads from read pairs are not double counted. The read with the highest base call quality is chosen.

## Filter for multi-allelic variants

Multi-allelic variants are those that have more than one alternative allele (e.g.: A>C,G).
This tool allows to select the allele with the highest allele frequency and filter out the lower frequency allele.

Run it as follows:
Run as follows:
```
$ vafator --help
usage: vafator [-h] --input-vcf INPUT_VCF --output-vcf OUTPUT_VCF
[--normal-bams NORMAL_BAMS [NORMAL_BAMS ...]]
[--tumor-bams TUMOR_BAMS [TUMOR_BAMS ...]]
[--mapping-quality MAPPING_QUALITY]
[--base-call-quality BASE_CALL_QUALITY]
[--prefix BASE_CALL_QUALITY]
optional arguments:
-h, --help show this help message and exit
--input-vcf INPUT_VCF
The VCF to annotate (default: None)
--output-vcf OUTPUT_VCF
The annotated VCF (default: None)
--normal-bams NORMAL_BAMS [NORMAL_BAMS ...]
Whitespace-separated list of normal BAMs to analyse
(default: [])
--tumor-bams TUMOR_BAMS [TUMOR_BAMS ...]
Whitespace-separated list of tumor BAMs to analyse
(default: [])
--mapping-quality MAPPING_QUALITY
All reads with a mapping quality lower or equal than
this threshold will be filtered out (default: 0)
--base-call-quality BASE_CALL_QUALITY
All bases with a base call quality lower or equal than
this threshold will be filtered out (default: 29)
--prefix PREFIX
When provided the annotations are preceded by this prefix, otherwise the annotations
are named as tumor_af, normal_af, tumor_ac, normal_ac, tumor_dp and normal_dp
Copyright (c) 2019-2021 TRON gGmbH (See LICENSE for licensing details)
multiallelics-filter --input-vcf /path/to/your_vafator.vcf \
--output-vcf /path/to/your_vafator_filtered.vcf \
--tumor-sample-name <SAMPLE>
```

The above will look for the annotation `<SAMPLE>_af` and for multi-allelic variants it will filter out those with lower
frequencies. Beware, that if the multi-allelic variants are split into more than one line in the VCF nothing will be
filtered out.

## Run as a Nextflow pipeline

VAFator is available as a Nextflow pipeline for convenience.

Run as follows:
```
nextflow run tron-bioinformatics/vafator -r 1.0.0 -profile conda --input_files /path/to/your.tsv
```

where `--input_files` expects four tab-separated columns **without a header**:

| Sample name | VCF | Tumor BAMs | Normal BAMs |
|-----------------|------------------------|-------------------------|-------------------------|
| sample_1 | /path/to/sample_1.vcf | /path/to/sample_1_tumor_1.bam,/path/to/sample_1_tumor_2.bam | /path/to/sample_1_normal.bam |
| sample_2 | /path/to/sample_2.vcf | /path/to/sample_2_tumor.bam | /path/to/sample_1_normal.bam |

Optional parameters:

* `--output`: the folder where to publish output
* `--skip_multiallelic_filter`: skip the filtering of multiallelics by frequency in the tumor (only highest frequency
variant at the same position is kept)
* `--base_call_quality`: minimum base call quality, reads with values below will be filtered out (default: 30)
* `--mapping_quality`: minimum mapping quality, reads with values below will be filtered out (default: 1)
* `--prefix`: when provided the annotations are preceded by this prefix (e.g.: <PREFIX>_tumor_ac, <PREFIX>_tumor_af, etc),
otherwise the annotations are named as tumor_af, normal_af, tumor_ac, normal_ac, tumor_dp and normal_d

## Support for indels

VAFator provides equivalent annotations for indels. Depth of coverage and allele frequency are calculated on the
position immediately before the indel. Only insertions and deletions as recorded in the CIGAR matching the respective
coordinates and sequence from the VCF file are taken into account. Any read supporting a similar but not equal indel
will not be counted.
Also, multiallelic mutations are not supported for indels.


## Support for MNVs

Not supported
9 changes: 0 additions & 9 deletions environment.yml

This file was deleted.

12 changes: 8 additions & 4 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ params.cpus = 1
params.memory = "4g"

profiles {
conda { process.conda = "$baseDir/environment.yml" }
conda {
params.enable_conda = true
}
debug { process.beforeScript = 'echo $HOSTNAME' }
test {
params.cpus = 1
Expand All @@ -25,7 +27,7 @@ process.shell = ['/bin/bash', '-euo', 'pipefail']
cleanup = true
conda.createTimeout = '1 h'

VERSION = '0.4.3'
VERSION = '1.0.0'

manifest {
name = 'TRON-Bioinformatics/vafator'
Expand Down Expand Up @@ -58,8 +60,10 @@ Input:
Optional input:
* output: the folder where to publish output
* skip_multiallelic_filter: skip the filtering of multiallelics by frequency in the tumor (only highest frequency variant at the same position is kept)
* base_call_quality: threshold for the base call quality, only base calls with a higher value are considered (default: 29)
* mapping_quality: threshold for the mapping quality, only reads with a higher value are considered (default: 0)
* base_call_quality: minimum base call quality, reads with values below will be filtered out (default: 30)
* mapping_quality: minimum mapping quality, reads with values below will be filtered out (default: 1)
* prefix: when provided the annotations are preceded by this prefix (e.g.: <PREFIX>_tumor_ac, <PREFIX>_tumor_af, etc),
otherwise the annotations are named as tumor_af, normal_af, tumor_ac, normal_ac, tumor_dp and normal_d
Output:
* Annotated VCF file
Expand Down
5 changes: 5 additions & 0 deletions nf_modules/vafator.nf
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ params.mapping_quality = false
params.base_call_quality = false
params.skip_multiallelic_filter = false
params.prefix = false
params.enable_conda = false


process VAFATOR {
Expand All @@ -13,6 +14,8 @@ process VAFATOR {
tag "${name}"
publishDir "${params.output}/${name}", mode: "copy"

conda (params.enable_conda ? "bioconda::vafator=1.0.0" : null)

input:
tuple val(name), file(vcf), val(normal_bams), val(tumor_bams)

Expand All @@ -39,6 +42,8 @@ process MULTIALLELIC_FILTER {
tag "${name}"
publishDir "${params.output}/${name}", mode: "copy"

conda (params.enable_conda ? "bioconda::vafator=1.0.0" : null)

input:
tuple val(name), file(vcf)

Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
pandas==1.3.3
pysam==0.17.0
cyvcf2==0.30.11
cyvcf2==0.30.11
logzero==1.7.0
2 changes: 1 addition & 1 deletion vafator/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
VERSION='0.4.0'
VERSION='1.0.0'
Loading

0 comments on commit 3b46385

Please sign in to comment.