Skip to content

Commit

Permalink
Merge pull request #44 from TRON-Bioinformatics/ambiguous-bases
Browse files Browse the repository at this point in the history
Add ambiguous bases annotations
  • Loading branch information
priesgo authored Jan 30, 2023
2 parents 6163281 + 1e9be78 commit 63982b9
Show file tree
Hide file tree
Showing 11 changed files with 229 additions and 131 deletions.
19 changes: 10 additions & 9 deletions .github/workflows/integration_tests.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,28 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: [ 3.7, 3.8, 3.9, 3.10 ]
python-version: [ '3.7', '3.8', '3.9', '3.10' ]

steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: conda-incubator/setup-miniconda@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
auto-update-conda: true
channels: defaults,conda-forge,bioconda
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get --assume-yes install build-essential libcurl4-openssl-dev libz-dev liblzma-dev bedtools
sudo apt-get --assume-yes install build-essential libcurl4-openssl-dev libz-dev liblzma-dev
python -m pip install --upgrade pip
pip install setuptools wheel
# this is needed by cyvcf2 and pysam
conda install --yes htslib=1.14
# this is needed by pybedtools
sudo apt-get --assume-yes install bedtools
# install python requirements
pip install -r requirements.txt
- name: Install vafator
run: |
python setup.py bdist_wheel
pip install dist/*
pip install dist/vafator-*.whl
- name: Run integration tests
run: |
make clean integration_tests
14 changes: 10 additions & 4 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,22 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: [ 3.7, 3.8, 3.9, 3.10 ]
python-version: [ '3.7', '3.8', '3.9', '3.10' ]

steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel
pip install setuptools wheel coverage
pip install virtualenv tox==3.26.0 tox-wheel==1.0.0 tox-gh-actions
- name: Build, install and run unit tests
run: |
tox
pip install -r requirements.txt
coverage run -m unittest discover vafator.tests
- name: Upload Coverage to Codecov
uses: codecov/codecov-action@v3
67 changes: 41 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
[![PyPI version](https://badge.fury.io/py/vafator.svg)](https://badge.fury.io/py/vafator)
[![Anaconda-Server Badge](https://anaconda.org/bioconda/vafator/badges/version.svg)](https://anaconda.org/bioconda/vafator)
[![Run unit tests](https://github.com/TRON-Bioinformatics/vafator/actions/workflows/unit_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/vafator/actions/workflows/unit_tests.yml)
[![codecov](https://codecov.io/gh/TRON-Bioinformatics/vafator/branch/master/graph/badge.svg?token=QLK84NI44G)](https://codecov.io/gh/TRON-Bioinformatics/vafator)
[![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)


Expand All @@ -20,25 +21,28 @@ somatic variant calling, tumor evolution with multiple samples at different time
|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|-------|-----------------|
| Allele frequency (AF) | Ratio of reads supporting the alternate allele | {sample}_af | float | A |
| Allele count (AC) | Count of reads supporting the alternate allele | {sample}_ac | int | A |
| Count ambiguous bases (N) | Count of ambiguous bases (N + all IUPAC ambiguity codes) overlapping the variant | {sample}_n | int | 1 |
| Depth of coverage (DP) | Count of reads covering the position of the variant | {sample}_dp | int | 1 |
| Expected allele frequency | Expected allele frequency assuming a multiplicity of the mutation m=1 (the number of DNA copies bearing a mutation) considering the given purity and ploidy/copy numbers | {sample}_eaf | float | 1 |
| Probability undetected | Probability that a given somatic mutation is undetected given the total coverage, supporting reads and expected allele frequency | {sample}_pu | float | A |
| Power | Power to detect a somatic mutation given the total coverage and expected allele frequency (Carter, 2012) | {sample}_pw | float | 1 |
| k | Minimum number of supporting reads such that the probability of observing k or more non-reference reads due to sequencing error is less than the defined false positive rate (FPR) (Carter, 2012) | {sample}_k | float | 1 |
| Mapping quality median | Median mapping quality of reads supporting each of the reference and the alternate alleles | {sample}_mq | float | R |
| Mapping quality rank sum test | Rank sum test comparing the MQ distribution between reads supporting the reference and the alternate. Values closer to zero indicate similar distributions | {sample}_rsmq | float | A |
| Mapping quality rank sum test p-value | Significance of the rank sum test comparing the MQ distribution between reads supporting the reference and the alternate. The null hypothesis is that there is no difference in the distributions | {sample}_rsmq_pv | float | A |
| Mapping quality rank sum test (§§) | Rank sum test comparing the MQ distribution between reads supporting the reference and the alternate. Values closer to zero indicate similar distributions. | {sample}_rsmq | float | A |
| Mapping quality rank sum test p-value (§§) | Significance of the rank sum test comparing the MQ distribution between reads supporting the reference and the alternate. The null hypothesis is that there is no difference in the distributions | {sample}_rsmq_pv | float | A |
| Base call quality median | Median base call quality of reads supporting each of the reference and the alternate alleles (not available for deletions) | {sample}_bq | float | R |
| Base call quality rank sum test | Rank sum test comparing the BQ distribution between reads supporting the reference and the alternate. Values closer to zero indicate similar distributions | {sample}_rsbq | float | A |
| Base call quality rank sum test p-value | Significance of the rank sum test comparing the BQ distribution between reads supporting the reference and the alternate. The null hypothesis is that there is no difference in the distributions | {sample}_rsbq_pv | float | A |
| Base call quality rank sum test (§§) | Rank sum test comparing the BQ distribution between reads supporting the reference and the alternate. Values closer to zero indicate similar distributions | {sample}_rsbq | float | A |
| Base call quality rank sum test p-value (§§) | Significance of the rank sum test comparing the BQ distribution between reads supporting the reference and the alternate. The null hypothesis is that there is no difference in the distributions | {sample}_rsbq_pv | float | A |
| Position median | Median position within reads supporting each of the reference and the alternate alleles (in indels this is the start position) | {sample}_pos | float | R |
| Position rank sum test | Rank sum test comparing the position distribution between reads supporting the reference and the alternate. Values closer to zero indicate similar distributions | {sample}_rspos | float | A |
| Position rank sum test p-value | Significance of the rank sum test comparing the position distribution between reads supporting the reference and the alternate. The null hypothesis is that there is no difference in the distributions | {sample}_rspos_pv | float | A |

| Position rank sum test (§§) | Rank sum test comparing the position distribution between reads supporting the reference and the alternate. Values closer to zero indicate similar distributions | {sample}_rspos | float | A |
| Position rank sum test p-value (§§) | Significance of the rank sum test comparing the position distribution between reads supporting the reference and the alternate. The null hypothesis is that there is no difference in the distributions | {sample}_rspos_pv | float | A |

§ cardinality is defined as in the VCF format specification: `A` refers to one value per alternate allele,
`R` refers to one value per possible allele (including the reference), `1` refers to one value.

§§ rank sum tests require at least one read supporting the reference and one read supporting the alternate


VAFator uses cyvcf2 (Pederson, 2017) to read/write VCF files and pysam (https://github.com/pysam-developers/pysam) to read BAM files.
Both libraries are cython wrappers around HTSlib (Bonfield, 2021).

Expand All @@ -64,9 +68,26 @@ vafator --input-vcf /path/yo/your.vcf \
```

This will add annotations for each of the three samples `normal`, `primary` and `metastasis`: `normal_ac`,
`normal_dp`, `normal_af`, `normal_pw`, `primary_ac`, `primary_dp`, etc.
`normal_dp`, `normal_af`, `normal_pw`, `primary_ac`, `primary_dp`, etc.

### Support for indels

VAFator provides equivalent annotations for indels. Depth of coverage and allele frequency are calculated on the
position immediately before the indel. Only insertions and deletions as recorded in the CIGAR matching the respective
coordinates and sequence from the VCF file are taken into account. Any read supporting a similar but not identical indel
is not counted.

**NOTE**: multiallelic mutations are not supported for indels, the indel in the multiallelic position will be
annotated with null values. This problem can be circumvented by using the Nextflow normalization pipeline described above.

### Support for MNVs

Not supported at the moment when not decomposed.

If running the nextflow pipeline indicated above, MNVs and complex variants are by default decomposed and hence
correctly annotated by VAFator.

## Support for technical replicates
### Support for technical replicates

If more than one BAM for the same sample is provided then the annotations are calculated across all BAMs
and for also each of them separately (eg: `primary_af` provides the allele frequency across all primary tumor BAMs,
Expand Down Expand Up @@ -169,6 +190,17 @@ The statistic value will be close to zero for similar distributions and further
The significance value corresponds to the null hypothesis of similar distributions.
No multiple test correction is applied over this p-value.

### Ambiguous bases

Some reads may contain ambiguous bases with high base call quality scores.
The count of all reads passing the quality thresholds that contain an
ambiguous base overlapping the mutation are annotated.
All IUPAC ambiguity codes are taken into account.

Also, these reads supporting ambiguous bases are not taken into account in the depth of coverage (DP)
as they may dilute the VAF values. In order to include those into the depth of coverage use the flag
`--include-ambiguous-bases`. Only SNVs are supported.

## Understanding the output

The output is a VCF with the some new annotations in the INFO field for the provided sample names.
Expand Down Expand Up @@ -208,23 +240,6 @@ nextflow run tron-bioinformatics/tronflow-vcf-postprocessing -r 2.2.0 -profile c
See https://github.com/TRON-Bioinformatics/tronflow-vcf-postprocessing for more details


## Support for indels

VAFator provides equivalent annotations for indels. Depth of coverage and allele frequency are calculated on the
position immediately before the indel. Only insertions and deletions as recorded in the CIGAR matching the respective
coordinates and sequence from the VCF file are taken into account. Any read supporting a similar but not identical indel
is not counted.

**NOTE**: multiallelic mutations are not supported for indels, the indel in the multiallelic position will be
annotated with null values. This problem can be circumvented by using the Nextflow normalization pipeline described above.

## Support for MNVs

Not supported at the moment when not decomposed.

If running the nextflow pipeline indicated above, MNVs and complex variants are by default decomposed and hence
correctly annotated by VAFator.

## Bibliography

- Pedersen, B. S., & Quinlan, A. R. (2017). cyvcf2: fast, flexible variant analysis with Python. Bioinformatics, 33(12), 1867–1869. https://doi.org/10.1093/BIOINFORMATICS/BTX057
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ pysam~=0.19.1
cyvcf2~=0.30.14
logzero~=1.7.0
pybedtools~=0.9.0
scipy~=1.8.1
scipy>=1.0.0,<2.0.0
16 changes: 0 additions & 16 deletions tox.ini

This file was deleted.

5 changes: 4 additions & 1 deletion vafator/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
VERSION='2.1.0'
VERSION='2.2.0'


AMBIGUOUS_BASES = ['N', 'M', 'R', 'W', 'S', 'Y', 'K', 'V', 'H', 'D', 'B']
Loading

0 comments on commit 63982b9

Please sign in to comment.