Skip to content

Commit

Permalink
Merge pull request #172 from eastgenomics/release-2.7.1
Browse files Browse the repository at this point in the history
Release_2.7.1 (#172)

Co-Authored-By: Jethro Rainford <[email protected]>
Co-Authored-By: mattgarner <[email protected]>
  • Loading branch information
mattgarner and jethror1 authored Feb 16, 2024
2 parents b1daf06 + 3d342bb commit 4e56485
Show file tree
Hide file tree
Showing 16 changed files with 6,695 additions and 280 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: pytest
on: [push]

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v1
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pipenv codecov
pip install -r requirements.txt
pipenv install --dev
- name: Build bcftools
run: |
wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
tar xf bcftools-1.18.tar.bz2
cd bcftools-1.18
./configure --prefix=/usr/local/
sudo make
sudo make install
- name: Build htslib
run: |
wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2
tar xf htslib-1.18.tar.bz2
cd htslib-1.18
./configure --prefix=/usr/local/
sudo make
sudo make install
- name: Test with pytest
run: |
pytest -vv --cov resources/home/dnanexus/generate_workbook/
9 changes: 6 additions & 3 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

# egg_generate_workbook (DNAnexus Platform App)

![pytest](https://github.com/eastgenomics/eggd_generate_variant_workbook/actions/workflows/pytest.yml/badge.svg)

## What does this app do?

Generate an Excel workbook from VEP annotated vcf(s)
Generates an Excel workbook from vcf(s)

## What are typical use cases for this app?

Expand All @@ -21,7 +23,7 @@ This app may be executed as a standalone app.

**File inputs (required)**:

- `--vcfs`: VEP annotated vcf(s)
- `--vcfs`: vcf file(s) to write to Excel workbook sheets

**Other Inputs (optional):**

Expand Down Expand Up @@ -84,7 +86,7 @@ This app may be executed as a standalone app.

`--human_filter` (`string`): String to add to summary sheet with humanly readable form of the given filter string. No checking is done of this matching the actual filter(s) used.

`--acmg` (`bool`): Adds extra sheet to workbook with reporting criteria against ACMG classifications
`--acmg` (`int`): Number of extra sheet(s) to be added to workbook with reporting criteria against ACMG classifications

`--panel` (`string`): Name of panel to display in summary sheet.

Expand All @@ -98,6 +100,7 @@ This app may be executed as a standalone app.

`--split_hgvs` (`bool`): If true, the c. and p. changes in HGVSc and HGVSp will be split out into DNA and Protein columns respectively, without the transcript

`--lock_sheet` (`bool`): If true, all sheets in the variant workbook are locked for dias pipeline except specific cells

**Example**:

Expand Down
17 changes: 12 additions & 5 deletions dxapp.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
"title": "eggd_generate_variant_workbook",
"summary": "Create Excel workbook from VEP annotated vcf",
"dxapi": "1.0.0",
"version": "2.6.0",
"whatsNew": "* v2.0.0 Rewrite of previous app to generate xlsx file from a VEP annotated VCF(s); * v2.0.1 Bug fix to correctly treat CHROM as string values; * v2.0.2 Bug fix for ACMG report template structure; * v2.0.3 Bug fixes for issues with hyperlinks, changed app name to eggd_generate_variant_workbook; * v2.1.0 Handle VCFs from GATK gCNV and Illumina TSO500, readability tweaks to variant sheets; * v2.1.1 Bug fix for typing of numeric values in hyperlinks; * v2.2.0 Added ability to pass in non VCF files (tsvs/csvs and images) to additional sheets, optional adding of links to DECIPHER with --decipher; * v2.3.0 Added conditional colouring of cells in variant sheets, new 'basic' summary sheet; * v2.4.0 Added handling for duplicate annotation in VEP fields (i.e. cosmic, CGC, etc..); * v2.5.0 Better parsing of CombinedVariantOutput files as additional files; * v2.6.0 Add variant counts as DNAnexus file details to the .xlsx workbook",
"version": "2.7.1",
"whatsNew": "* v2.0.0 Rewrite of previous app to generate xlsx file from a VEP annotated VCF(s); * v2.0.1 Bug fix to correctly treat CHROM as string values; * v2.0.2 Bug fix for ACMG report template structure; * v2.0.3 Bug fixes for issues with hyperlinks, changed app name to eggd_generate_variant_workbook; * v2.1.0 Handle VCFs from GATK gCNV and Illumina TSO500, readability tweaks to variant sheets; * v2.1.1 Bug fix for typing of numeric values in hyperlinks; * v2.2.0 Added ability to pass in non VCF files (tsvs/csvs and images) to additional sheets, optional adding of links to DECIPHER with --decipher; * v2.3.0 Added conditional colouring of cells in variant sheets, new 'basic' summary sheet; * v2.4.0 Added handling for duplicate annotation in VEP fields (i.e. cosmic, CGC, etc..); * v2.5.0 Better parsing of CombinedVariantOutput files as additional files; * v2.6.0 Add variant counts as DNAnexus file details to the .xlsx workbook; *v2.7.0 Handle pre-split and non VEP annotated VCFs, improvements to Dias reporting templates and Excel data validation; * v2.7.1 v.2.7.0 app was accidentally published on DNAnexus before testing; so a new version is created. Everything except version number is the same as v2.7.0",
"authorizedUsers": [
"org-emee_1"
],
Expand Down Expand Up @@ -210,10 +210,10 @@
{
"name": "acmg",
"label": "ACMG",
"class": "boolean",
"class": "int",
"optional": true,
"default": false,
"help": "Determines if to add extra sheet with ACMG reporting criteria",
"default": 0,
"help": "Determines number of extra sheet(s) with ACMG reporting criteria",
"group": "generate_workbook.py"
},
{
Expand Down Expand Up @@ -269,6 +269,13 @@
"class": "boolean",
"optional" : true,
"help": "If true, will add a column named 'rawChange' with a concatenation of columns formatted as {CHROM}:g.{POS}{REF}>{ALT}"
},
{
"name": "lock_sheet",
"label": "lock_sheet",
"class": "boolean",
"optional" : true,
"help": "If true, all sheets in the variant workbook are locked for dias pipeline except specific cells"
}
],
"outputSpec": [
Expand Down
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ filetype==1.1.0
jarowinkler==1.2.1
Levenshtein==0.20.2
numpy==1.23.2
pytest==7.0.1
pytest-cov==4.0.0
pytest-html==4.1.0
pytest-metadata==3.0.0
pytest-mock==3.11.1
pytest-subtests==0.11.0
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2022.2.1
Expand Down
12 changes: 8 additions & 4 deletions resources/home/dnanexus/generate_workbook/generate_workbook.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import os
from pathlib import Path
import re
import sys

from filetype import is_image

Expand Down Expand Up @@ -230,13 +229,18 @@ def parse_args(self) -> argparse.Namespace:
)
)
parser.add_argument(
'--acmg', action='store_true',
help='add extra ACMG reporting template sheet'
'--acmg', type=int,
help='add extra ACMG reporting template sheet(s)'
)
parser.add_argument(
'--job_id', required=False,
help='Job ID of eggd_generate_workbook to add to Dias summary'
)
parser.add_argument(
'--lock_sheet', action='store_true',
help='lock all sheets in the variant workbook in dias pipeline'
'except specific cells'
)
parser.add_argument(
'--workflow', default=('', ''), nargs=2,
help='Name and ID of workflow to display in summary'
Expand Down Expand Up @@ -424,7 +428,7 @@ def verify_images(self) -> None:

if self.args.image_sizes:
assert all(
[re.match('\d+:\d+', x) for x in self.args.image_sizes]
[re.match(r'\d+:\d+', x) for x in self.args.image_sizes]
), (
'Sizes for images specified not in correct format: '
f'{self.args.image_sizes}'
Expand Down
94 changes: 56 additions & 38 deletions resources/home/dnanexus/generate_workbook/tests/test_columns.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
import argparse
from cgi import test
import os
from pathlib import Path
import subprocess
import sys
from unittest.mock import NonCallableMagicMock

import pytest

Expand All @@ -23,14 +20,30 @@ def read_test_vcf(vcf_file):
"""
# initialise vcf class with a valid argparse input to allow calling .read()
vcf_handler = vcf(argparse.Namespace(
add_name=False, analysis='', clinical_indication='', exclude=None,
filter=None, include=None, keep=False, merge=False,
add_name=False,
add_classification_column=False,
analysis='',
clinical_indication='',
exclude=None,
filter=None,
include=None,
keep=False,
merge=False,
add_comment_column=False,
out_dir='',
output='',
panel='', print_columns=False, print_header=False, reads='',
rename=None, reorder=None, sample='', sheets=['variants'],
summary=None, usable_reads='', vcfs=[vcf_file], workflow=('', '')
panel='',
print_columns=False,
print_header=False,
reads='',
rename=None,
reorder=None,
sample='',
sheets=['variants'],
summary=None,
usable_reads='',
vcfs=[vcf_file],
workflow=('', '')
))
vcf_df = vcf_handler.read(vcf_file)

Expand All @@ -53,7 +66,8 @@ def read_column_from_vcf(vcf, column) -> list:
list : column data read from vcf
"""
output = subprocess.run(
f"grep -v '^#' {vcf} | cut -f{column}", shell=True, capture_output=True
f"grep -v '^#' {vcf} | cut -f{column}",
shell=True, capture_output=True, check=True
)

return output.stdout.decode().splitlines()
Expand Down Expand Up @@ -165,7 +179,7 @@ def test_parsed_correct_columns_from_info_records(self) -> None:
(
f"cut -f8 {self.test_vcf} | grep -oh "
f"';[A-Za-z0-9\_\-\.]*=' | sort | uniq",
), shell=True, capture_output=True
), shell=True, capture_output=True, check=True
)

# get cleaned list that should be df column names
Expand All @@ -187,7 +201,7 @@ def test_parsed_correct_gnomAD_AF_values(self):
(
f"grep -v '^#' {self.test_vcf} | grep -oh "
f"'gnomAD_AF=[0-9\.e\-]*;' | sort | uniq"
), shell=True, capture_output=True
), shell=True, capture_output=True, check=True
)

# clean up values
Expand Down Expand Up @@ -219,7 +233,7 @@ class TestFormatSample():
# get list of FORMAT fields from VCF FORMAT column
output = subprocess.run((
f"grep -v '^#' {test_vcf} | cut -f9 | sort | uniq"
), shell=True, capture_output=True)
), shell=True, capture_output=True, check=True)

format_fields = sorted(output.stdout.decode().split())[0].split(':')

Expand All @@ -232,7 +246,7 @@ class TestFormatSample():
# get all SAMPLE values from vcf
output = subprocess.run((
f"grep -v '^#' {test_vcf} | cut -f10"
), shell=True, capture_output=True)
), shell=True, capture_output=True, check=True)

sample_strings_vcf = output.stdout.decode().splitlines()

Expand All @@ -259,6 +273,7 @@ class TestVEPHandling():
"""
# test vcf standard sample
test_vcf = os.path.join(TEST_DATA_DIR, "HD753-unittest_annotated.split.vcf")

# run dataframe through splitColumns.info() to split out INFO column
vcf_df = read_test_vcf(vcf_file=test_vcf)
vcf_df = splitColumns().split(vcf_df)
Expand All @@ -274,45 +289,48 @@ def test_parsed_correct_COSMICcMuts_values(self):
(
f"grep -v '^#' {self.test_vcf} | grep -oh "
f"'COSMICcMuts=[A-Z0-9&\.]*;' | sort | uniq"
), shell=True, capture_output=True
), shell=True, capture_output=True, check=True
)

# clean up values
stdout = output.stdout.decode().splitlines()
stdout = sorted(list([
x.replace(';', '').replace('COSMICcMuts=', '') for x in stdout
]))
stdout = [' & '.join(set(x.split("&"))) for x in stdout]

# get COSMICcMuts values from dataframe
df_values = sorted(list(self.vcf_df['CSQ_COSMICcMuts'].unique().tolist()))
assert all([str(x) == str(y) for x, y in zip(stdout, df_values)]), (
"COSMICcMuts values in VCF do not match those in dataframe"
)

def test_parsed_correct_COSMICncMuts_values(self):
"""
Test values read into dataframe for COSMICncMuts match the values
above from the VCF
"""
# read COSMICncMuts values from vcf
output = subprocess.run(
(
f"grep -v '^#' {self.test_vcf} | grep -oh "
f"'COSMICncMuts=[A-Z0-9&\.]*;' | sort | uniq"
), shell=True, capture_output=True
)

# clean up values
stdout = output.stdout.decode().splitlines()
stdout = sorted(list([
x.replace(';', '').replace('COSMICncMuts=', '') for x in stdout
]))
stdout = [' & '.join(set(x.split("&"))) for x in stdout]
# get COSMICncMuts values from dataframe
df_values = sorted(list(self.vcf_df['CSQ_COSMICncMuts'].unique().tolist()))

assert all([str(x) == str(y) for x, y in zip(stdout, df_values)]), (
"COSMICncMuts values in VCF do not match those in dataframe"
)
"""
Test values read into dataframe for COSMICncMuts match the values
above from the VCF
"""
# read COSMICncMuts values from vcf
output = subprocess.run(
(
f"grep -v '^#' {self.test_vcf} | grep -oh "
f"'COSMICncMuts=[A-Z0-9&\.]*;' | sort | uniq"
), shell=True, capture_output=True, check=True
)

# clean up values
stdout = output.stdout.decode().splitlines()
stdout = sorted(list([
x.replace(';', '').replace('COSMICncMuts=', '') for x in stdout
]))
stdout = [' & '.join(set(x.split("&"))) for x in stdout]

# get COSMICncMuts values from dataframe
df_values = sorted(list(self.vcf_df['CSQ_COSMICncMuts'].unique().tolist()))

assert all([str(x) == str(y) for x, y in zip(stdout, df_values)]), (
"COSMICncMuts values in VCF do not match those in dataframe"
)


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit 4e56485

Please sign in to comment.