eastgenomics · jethror1 · Mar 5, 2024 · Feb 19, 2024 · Feb 22, 2024 · Feb 23, 2024
diff --git a/.github/workflows/pytest.yaml b/.github/workflows/pytest.yaml
@@ -6,10 +6,10 @@ jobs:
     runs-on: ubuntu-latest
     steps:
     - uses: actions/checkout@v2
-    - name: Set up Python 3.8
+    - name: Set up Python 3.12.2
       uses: actions/setup-python@v1
       with:
-        python-version: 3.8
+        python-version: 3.12.2
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip

diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # gene_annotation2bed
 
 ## Purpose
+
 To provide bed files for custom gene-level annotation with VEP.
 This custom script processes a list of ids (HGNC, transcript) or coordinates with associated annotation, into a comprehensive bed file for the corresponding refseq transcripts for each ID entry.
 
@@ -16,10 +17,14 @@ grep -v "#" myData.bed | sort -k1,1 -k2,2n -k3,3n -t$'\t' | bgzip -c > myData.be
 tabix -p bed myData.bed.gz
 ```
 
+When using VEP, the JSON should specify 'overlap' as in the VEP custom documentation when annotating gene/transcript level annotation. For more information on how to use custom annotation please see:
+[Internal Document which is helpful](https://cuhbioinformatics.atlassian.net/wiki/spaces/DV/pages/2605711365/VEP+Config+file).
+
 Workflow diagram showing TSV containing IDs and annotation to bed file and how it is used in VEP and visualised in IGV:
 ![Workflow diagram showing TSV containing IDs and annotation to bed file and how it is used in VEP and visualised in IGV using a VCF](https://raw.githubusercontent.com/eastgenomics/gene_annotation2bed/sprint_2/Workflow.png)
 
 ---
+
 ## What are typical use cases for this script?
 
 - Converting a list of HGNC ids + associated gene level annotation information
@@ -28,54 +33,68 @@ Workflow diagram showing TSV containing IDs and annotation to bed file and how i
   Or using exact coordinates to flag a regions such as TERT promoter.
 
 ---
+
 ## What data are required for this script to run?
 
 - List of ids and annotation information in TSV format.
-- Human Genome Reference (i.e. hs37d5).
-- RefSeq Transcripts file (gff3) from 001_reference.
+- Human Genome Reference (i.e. hs37d5) for generating IGV report.
+- RefSeq Transcripts file (gff3; `GCF_000001405.25_GRCh37.p13_genomic.gff`) and corresponding gff assembly report, i.e.`GCF_000001405.25_GRCh37.p13_assembly_report.txt` from 001_reference
+  or from [Refseq FTP Server](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/).
 ![Image of refseq tsv structure, first 15 lines](images/image.png)
 
 ---
 ## Notes
 The current working logic of this script is to select only refseq transcripts with NM_ prefix.
+The bed returned is zero-based.
+
+---
+
+## Example Command
+
+```bash
+python gene_annotation2bed.py -ann annotation.tsv -o output_suffix -ref hg38 -f 5 --assembly_report assembly_report.txt -ref_igv ref_genome.fasta -gff your_file.gff -pickle pickle_file.pkl
+```
 
 ---
 
 ## What inputs are required for this app to run?
 
 ### Required
+
 - `-ann`, `--annotation_file` (`str`): Path to the annotation file (TSV), this file is essential for the app to execute successfully.
 - `-o`, `output` (`str`): Output file suffix, required for specifying the suffix for the generated output files.
 - `-build`, `--genome_build` (`str`): Reference genome build (hg19/hg38), choose either 'hg19' or 'hg38' based on your requirements.
 - `-f`, `--flanking` (`int`): Flanking size, an integer value representing the size of flanking regions for each gene, transcript or coordinates provided. Default = 0.
-- `-as`, `--assembly_summary` (`str`): Path to assembly summary file, necessary for the app to gather assembly information, this allows for the script to map between refseq accessions and chromosomes.
-- `-gff` (`str`): Path to GFF file containing all relevant transcripts for assay, available in 001_reference i.e. GCF_000001405.25_GRCh37.p13_genomic.gff.
+- `-gff` (`str`): Path to GFF file containing all relevant transcripts for assay, available in 001_reference i.e. `GCF_000001405.25_GRCh37.p13_genomic.gff`.
 
 ### Useful ones
 
 #### Files
+
 - `-ref_igv`, `--reference_file_for_igv` (`file`): Path to the Reference genome fasta file for igv_reports, used in generating IGV reports.
 
 ## Misc
-- `--pickle` or `-pkl` (`str`): Import GFF as a pickle file, this is for testing mostly to speed-up running, so gff isn't processed each time.
-
-## Example Command
 
-```bash
-python gene_annotation2bed.py -ann annotation.tsv -o output_suffix -ref hg38 -f 5 --assembly_summary assembly_summary.txt -ref_igv ref_genome.fasta --hgnc_dump_path hgnc_info.tsv -gff your_file.gff -pickle pickle_file.pkl
-```
+- `--pickle` or `-pkl` (`str`): Import GFF as a pickle file, this is for testing mostly to speed-up running, so gff isn't processed each time.
 
 ---
 
 ## Requirements
 
+To generate IGV reports:
+HTSlib is required for generating IGV report with the bed file to check the accuracy.
+This uses tabix and bgzip. Version: 1.19.1.
+
+General requirements see requirements.txt for more info on versions.
+
 - pysam
-- pandas
-- igv-reports (v)
 - numpy
-- re
+- pandas
+- igv-reports (v1.12.0)
+- re (std lib)
 
-install using `requirements.txt`. `pip install requirements.txt`
+Install using `requirements.txt`. `pip install requirements.txt`
+Alternatively you can use conda and the yml provided.
 
 ---
 

diff --git a/data/mixed_dataset.tsv b/data/mixed_dataset.tsv
@@ -0,0 +1,7 @@
+ID	annotation
+NM_000059	TSG
+76	Oncogene
+391	Oncogene
+NM_000546	TSG
+NM_000124	TSG
+chr17:1-100000	Oncogene
diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,54 @@
+name: myenvtest
+channels:
+  - bioconda
+  - r
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=conda_forge
+  - _openmp_mutex=4.5=2_gnu
+  - bzip2=1.0.8=hd590300_5
+  - ca-certificates=2024.2.2=hbcca054_0
+  - ld_impl_linux-64=2.40=h41732ed_0
+  - libexpat=2.5.0=hcb278e6_1
+  - libffi=3.4.2=h7f98852_5
+  - libgcc-ng=13.2.0=h807b86a_5
+  - libgomp=13.2.0=h807b86a_5
+  - libnsl=2.0.1=hd590300_0
+  - libsqlite=3.45.1=h2797004_0
+  - libuuid=2.38.1=h0b41bf4_0
+  - libxcrypt=4.4.36=hd590300_1
+  - libzlib=1.2.13=hd590300_5
+  - ncurses=6.4=h59595ed_2
+  - openssl=3.2.1=hd590300_0
+  - pip=24.0=pyhd8ed1ab_0
+  - python=3.12.2=hab00c5b_0_cpython
+  - readline=8.2=h8228510_1
+  - setuptools=69.1.1=pyhd8ed1ab_0
+  - tk=8.6.13=noxft_h4845f30_101
+  - wheel=0.42.0=pyhd8ed1ab_0
+  - xz=5.2.6=h166bdaf_0
+  - pip:
+    - argcomplete==3.2.2
+    - certifi==2024.2.2
+    - charset-normalizer==3.3.2
+    - coverage==7.4.3
+    - idna==3.6
+    - igv-reports==1.12.0
+    - iniconfig==2.0.0
+    - intervaltree==3.1.0
+    - numpy==1.26.4
+    - packaging==23.2
+    - pandas==2.2.1
+    - pluggy==1.4.0
+    - pysam==0.22.0
+    - pytest==8.0.2
+    - pytest-cov==4.1.0
+    - python-dateutil==2.9.0
+    - pytz==2024.1
+    - requests==2.31.0
+    - six==1.16.0
+    - sortedcontainers==2.4.0
+    - tzdata==2024.1
+    - urllib3==2.2.1
+prefix: /home/rswilson1/anaconda3/envs/myenvtest