Merge pull request #172 from eastgenomics/release-2.7.1

Release_2.7.1 (#172) Co-Authored-By: Jethro Rainford <[email protected]> Co-Authored-By: mattgarner <[email protected]>
eastgenomics · Feb 16, 2024 · 4e56485 · 4e56485
2 parents b1daf06 + 3d342bb
commit 4e56485
Show file tree

Hide file tree

Showing 16 changed files with 6,695 additions and 280 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -0,0 +1,37 @@
+name: pytest
+on: [push]
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v1
+      with:
+        python-version: 3.8
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install pipenv codecov
+        pip install -r requirements.txt
+        pipenv install --dev
+    - name: Build bcftools
+      run: |
+        wget https://github.com/samtools/bcftools/releases/download/1.18/bcftools-1.18.tar.bz2
+        tar xf bcftools-1.18.tar.bz2
+        cd bcftools-1.18
+        ./configure --prefix=/usr/local/
+        sudo make
+        sudo make install
+    - name: Build htslib
+      run: |
+        wget https://github.com/samtools/htslib/releases/download/1.18/htslib-1.18.tar.bz2
+        tar xf htslib-1.18.tar.bz2
+        cd htslib-1.18
+        ./configure --prefix=/usr/local/
+        sudo make
+        sudo make install
+    - name: Test with pytest
+      run: |
+        pytest -vv --cov resources/home/dnanexus/generate_workbook/
diff --git a/Readme.md b/Readme.md
@@ -2,9 +2,11 @@
 
 # egg_generate_workbook (DNAnexus Platform App)
 
+![pytest](https://github.com/eastgenomics/eggd_generate_variant_workbook/actions/workflows/pytest.yml/badge.svg)
+
 ## What does this app do?
 
-Generate an Excel workbook from VEP annotated vcf(s)
+Generates an Excel workbook from vcf(s)
 
 ## What are typical use cases for this app?
 
@@ -21,7 +23,7 @@ This app may be executed as a standalone app.
 
 **File inputs (required)**:
 
-- `--vcfs`: VEP annotated vcf(s)
+- `--vcfs`: vcf file(s) to write to Excel workbook sheets
 
 **Other Inputs (optional):**
 
@@ -84,7 +86,7 @@ This app may be executed as a standalone app.
 
 `--human_filter` (`string`): String to add to summary sheet with humanly readable form of the given filter string. No checking is done of this matching the actual filter(s) used.
 
-`--acmg` (`bool`): Adds extra sheet to workbook with reporting criteria against ACMG classifications
+`--acmg` (`int`): Number of extra sheet(s) to be added to workbook with reporting criteria against ACMG classifications
 
 `--panel` (`string`): Name of panel to display in summary sheet.
 
@@ -98,6 +100,7 @@ This app may be executed as a standalone app.
 
 `--split_hgvs` (`bool`): If true, the c. and p. changes in HGVSc and HGVSp will be split out into DNA and Protein columns respectively, without the transcript
 
+`--lock_sheet` (`bool`): If true, all sheets in the variant workbook are locked for dias pipeline except specific cells
 
 **Example**:
 

diff --git a/dxapp.json b/dxapp.json
@@ -3,8 +3,8 @@
   "title": "eggd_generate_variant_workbook",
   "summary": "Create Excel workbook from VEP annotated vcf",
   "dxapi": "1.0.0",
-  "version": "2.6.0",
-  "whatsNew": "* v2.0.0 Rewrite of previous app to generate xlsx file from a VEP annotated VCF(s); * v2.0.1 Bug fix to correctly treat CHROM as string values; * v2.0.2 Bug fix for ACMG report template structure; * v2.0.3 Bug fixes for issues with hyperlinks, changed app name to eggd_generate_variant_workbook; * v2.1.0 Handle VCFs from GATK gCNV and Illumina TSO500, readability tweaks to variant sheets; * v2.1.1 Bug fix for typing of numeric values in hyperlinks; * v2.2.0 Added ability to pass in non VCF files (tsvs/csvs and images) to additional sheets, optional adding of links to DECIPHER with --decipher; * v2.3.0 Added conditional colouring of cells in variant sheets, new 'basic' summary sheet;  * v2.4.0 Added handling for duplicate annotation in VEP fields (i.e. cosmic, CGC, etc..); * v2.5.0 Better parsing of CombinedVariantOutput files as additional files; * v2.6.0 Add variant counts as DNAnexus file details to the .xlsx workbook",
+  "version": "2.7.1",
+  "whatsNew": "* v2.0.0 Rewrite of previous app to generate xlsx file from a VEP annotated VCF(s); * v2.0.1 Bug fix to correctly treat CHROM as string values; * v2.0.2 Bug fix for ACMG report template structure; * v2.0.3 Bug fixes for issues with hyperlinks, changed app name to eggd_generate_variant_workbook; * v2.1.0 Handle VCFs from GATK gCNV and Illumina TSO500, readability tweaks to variant sheets; * v2.1.1 Bug fix for typing of numeric values in hyperlinks; * v2.2.0 Added ability to pass in non VCF files (tsvs/csvs and images) to additional sheets, optional adding of links to DECIPHER with --decipher; * v2.3.0 Added conditional colouring of cells in variant sheets, new 'basic' summary sheet;  * v2.4.0 Added handling for duplicate annotation in VEP fields (i.e. cosmic, CGC, etc..); * v2.5.0 Better parsing of CombinedVariantOutput files as additional files; * v2.6.0 Add variant counts as DNAnexus file details to the .xlsx workbook; *v2.7.0 Handle pre-split and non VEP annotated VCFs, improvements to Dias reporting templates and Excel data validation; * v2.7.1 v.2.7.0 app was accidentally published on DNAnexus before testing; so a new version is created. Everything except version number is the same as v2.7.0",
   "authorizedUsers": [
     "org-emee_1"
   ],
@@ -210,10 +210,10 @@
     {
       "name": "acmg",
       "label": "ACMG",
-      "class": "boolean",
+      "class": "int",
       "optional": true,
-      "default": false,
-      "help": "Determines if to add extra sheet with ACMG reporting criteria",
+      "default": 0,
+      "help": "Determines number of extra sheet(s) with ACMG reporting criteria",
       "group": "generate_workbook.py"
     },
     {
@@ -269,6 +269,13 @@
       "class": "boolean",
       "optional" : true,
       "help": "If true, will add a column named 'rawChange' with a concatenation of columns formatted as {CHROM}:g.{POS}{REF}>{ALT}"
+    },
+    {
+      "name": "lock_sheet",
+      "label": "lock_sheet",
+      "class": "boolean",
+      "optional" : true,
+      "help": "If true, all sheets in the variant workbook are locked for dias pipeline except specific cells"
     }
   ],
   "outputSpec": [

diff --git a/requirements.txt b/requirements.txt
@@ -6,6 +6,12 @@ filetype==1.1.0
 jarowinkler==1.2.1
 Levenshtein==0.20.2
 numpy==1.23.2
+pytest==7.0.1
+pytest-cov==4.0.0
+pytest-html==4.1.0
+pytest-metadata==3.0.0
+pytest-mock==3.11.1
+pytest-subtests==0.11.0
 python-dateutil==2.8.2
 python-Levenshtein==0.12.2
 pytz==2022.2.1

diff --git a/resources/home/dnanexus/generate_workbook/generate_workbook.py b/resources/home/dnanexus/generate_workbook/generate_workbook.py
@@ -2,7 +2,6 @@
 import os
 from pathlib import Path
 import re
-import sys
 
 from filetype import is_image
 
@@ -230,13 +229,18 @@ def parse_args(self) -> argparse.Namespace:
             )
         )
         parser.add_argument(
-            '--acmg', action='store_true',
-            help='add extra ACMG reporting template sheet'
+            '--acmg', type=int,
+            help='add extra ACMG reporting template sheet(s)'
         )
         parser.add_argument(
             '--job_id', required=False,
             help='Job ID of eggd_generate_workbook to add to Dias summary'
         )
+        parser.add_argument(
+            '--lock_sheet', action='store_true',
+            help='lock all sheets in the variant workbook in dias pipeline'
+                 'except specific cells'
+        )
         parser.add_argument(
             '--workflow', default=('', ''), nargs=2,
             help='Name and ID of workflow to display in summary'
@@ -424,7 +428,7 @@ def verify_images(self) -> None:
 
         if self.args.image_sizes:
             assert all(
-                [re.match('\d+:\d+', x) for x in self.args.image_sizes]
+                [re.match(r'\d+:\d+', x) for x in self.args.image_sizes]
             ), (
                 'Sizes for images specified not in correct format: '
                 f'{self.args.image_sizes}'

diff --git a/resources/home/dnanexus/generate_workbook/tests/test_columns.py b/resources/home/dnanexus/generate_workbook/tests/test_columns.py
@@ -1,10 +1,7 @@
 import argparse
-from cgi import test
 import os
-from pathlib import Path
 import subprocess
 import sys
-from unittest.mock import NonCallableMagicMock
 
 import pytest
 
@@ -23,14 +20,30 @@ def read_test_vcf(vcf_file):
     """
     # initialise vcf class with a valid argparse input to allow calling .read()
     vcf_handler = vcf(argparse.Namespace(
-        add_name=False, analysis='', clinical_indication='', exclude=None,
-        filter=None, include=None, keep=False, merge=False,
+        add_name=False,
+        add_classification_column=False,
+        analysis='',
+        clinical_indication='',
+        exclude=None,
+        filter=None,
+        include=None,
+        keep=False,
+        merge=False,
         add_comment_column=False,
         out_dir='',
         output='',
-        panel='', print_columns=False, print_header=False, reads='',
-        rename=None, reorder=None, sample='', sheets=['variants'],
-        summary=None, usable_reads='', vcfs=[vcf_file], workflow=('', '')
+        panel='',
+        print_columns=False,
+        print_header=False,
+        reads='',
+        rename=None,
+        reorder=None,
+        sample='',
+        sheets=['variants'],
+        summary=None,
+        usable_reads='',
+        vcfs=[vcf_file],
+        workflow=('', '')
     ))
     vcf_df = vcf_handler.read(vcf_file)
 
@@ -53,7 +66,8 @@ def read_column_from_vcf(vcf, column) -> list:
     list : column data read from vcf
     """
     output = subprocess.run(
-        f"grep -v '^#' {vcf} | cut -f{column}", shell=True, capture_output=True
+        f"grep -v '^#' {vcf} | cut -f{column}",
+        shell=True, capture_output=True, check=True
     )
 
     return output.stdout.decode().splitlines()
@@ -165,7 +179,7 @@ def test_parsed_correct_columns_from_info_records(self) -> None:
             (
                 f"cut -f8 {self.test_vcf} | grep -oh "
                 f"';[A-Za-z0-9\_\-\.]*=' | sort | uniq",
-            ), shell=True, capture_output=True
+            ), shell=True, capture_output=True, check=True
         )
 
         # get cleaned list that should be df column names
@@ -187,7 +201,7 @@ def test_parsed_correct_gnomAD_AF_values(self):
             (
                 f"grep -v '^#' {self.test_vcf} | grep -oh "
                 f"'gnomAD_AF=[0-9\.e\-]*;' | sort | uniq"
-            ), shell=True, capture_output=True
+            ), shell=True, capture_output=True, check=True
         )
 
         # clean up values
@@ -219,7 +233,7 @@ class TestFormatSample():
     # get list of FORMAT fields from VCF FORMAT column
     output = subprocess.run((
         f"grep -v '^#' {test_vcf} | cut -f9 | sort | uniq"
-    ), shell=True, capture_output=True)
+    ), shell=True, capture_output=True, check=True)
 
     format_fields = sorted(output.stdout.decode().split())[0].split(':')
 
@@ -232,7 +246,7 @@ class TestFormatSample():
     # get all SAMPLE values from vcf
     output = subprocess.run((
         f"grep -v '^#' {test_vcf} | cut -f10"
-    ), shell=True, capture_output=True)
+    ), shell=True, capture_output=True, check=True)
 
     sample_strings_vcf = output.stdout.decode().splitlines()
 
@@ -259,6 +273,7 @@ class TestVEPHandling():
     """
     # test vcf standard sample
     test_vcf = os.path.join(TEST_DATA_DIR, "HD753-unittest_annotated.split.vcf")
+
     # run dataframe through splitColumns.info() to split out INFO column
     vcf_df = read_test_vcf(vcf_file=test_vcf)
     vcf_df = splitColumns().split(vcf_df)
@@ -274,45 +289,48 @@ def test_parsed_correct_COSMICcMuts_values(self):
             (
                 f"grep -v '^#' {self.test_vcf} | grep -oh "
                 f"'COSMICcMuts=[A-Z0-9&\.]*;' | sort | uniq"
-            ), shell=True, capture_output=True
+            ), shell=True, capture_output=True, check=True
         )
+
         # clean up values
         stdout = output.stdout.decode().splitlines()
         stdout = sorted(list([
             x.replace(';', '').replace('COSMICcMuts=', '') for x in stdout
         ]))
         stdout = [' & '.join(set(x.split("&"))) for x in stdout]
+
         # get COSMICcMuts values from dataframe
         df_values = sorted(list(self.vcf_df['CSQ_COSMICcMuts'].unique().tolist()))
         assert all([str(x) == str(y) for x, y in zip(stdout, df_values)]), (
             "COSMICcMuts values in VCF do not match those in dataframe"
         )
 
     def test_parsed_correct_COSMICncMuts_values(self):
-            """
-            Test values read into dataframe for COSMICncMuts match the values
-            above from the VCF
-            """
-            # read COSMICncMuts values from vcf
-            output = subprocess.run(
-                (
-                    f"grep -v '^#' {self.test_vcf} | grep -oh "
-                    f"'COSMICncMuts=[A-Z0-9&\.]*;' | sort | uniq"
-                ), shell=True, capture_output=True
-            )
-
-            # clean up values
-            stdout = output.stdout.decode().splitlines()
-            stdout = sorted(list([
-                x.replace(';', '').replace('COSMICncMuts=', '') for x in stdout
-            ]))
-            stdout = [' & '.join(set(x.split("&"))) for x in stdout]
-            # get COSMICncMuts values from dataframe
-            df_values = sorted(list(self.vcf_df['CSQ_COSMICncMuts'].unique().tolist()))
-
-            assert all([str(x) == str(y) for x, y in zip(stdout, df_values)]), (
-                "COSMICncMuts values in VCF do not match those in dataframe"
-            )
+        """
+        Test values read into dataframe for COSMICncMuts match the values
+        above from the VCF
+        """
+        # read COSMICncMuts values from vcf
+        output = subprocess.run(
+            (
+                f"grep -v '^#' {self.test_vcf} | grep -oh "
+                f"'COSMICncMuts=[A-Z0-9&\.]*;' | sort | uniq"
+            ), shell=True, capture_output=True, check=True
+        )
+
+        # clean up values
+        stdout = output.stdout.decode().splitlines()
+        stdout = sorted(list([
+            x.replace(';', '').replace('COSMICncMuts=', '') for x in stdout
+        ]))
+        stdout = [' & '.join(set(x.split("&"))) for x in stdout]
+
+        # get COSMICncMuts values from dataframe
+        df_values = sorted(list(self.vcf_df['CSQ_COSMICncMuts'].unique().tolist()))
+
+        assert all([str(x) == str(y) for x, y in zip(stdout, df_values)]), (
+            "COSMICncMuts values in VCF do not match those in dataframe"
+        )
 
 
 if __name__ == "__main__":