Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes and final changes #10

Merged
merged 25 commits into from
Mar 5, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
321f59c
Added printing out HGNC_ids which weren't found in merge and won't be…
RSWilson1 Feb 19, 2024
f7f7f43
Added handling for no IGV fasta provided.
RSWilson1 Feb 22, 2024
8b6a316
Update gene_annotation2bed.py and README.md with clearer exacmples an…
RSWilson1 Feb 23, 2024
e4d21e7
- Changed lost id printing for tests
RSWilson1 Feb 27, 2024
293404b
fixed if else logic for lost ids
RSWilson1 Feb 27, 2024
0a46a68
Removed old arg for handling gene symbols
RSWilson1 Feb 27, 2024
9a56623
Removed handling of assembly file.
RSWilson1 Feb 29, 2024
a3951b2
changed creating empty cols
RSWilson1 Mar 1, 2024
539eef9
Updated requirements docs and readme
RSWilson1 Mar 1, 2024
bec0b56
Fixing small changes to get tests passing.
RSWilson1 Mar 1, 2024
8c66631
Fixed genome_build supplied for tests
RSWilson1 Mar 1, 2024
edd3d1a
Updated readme with latest changes and formatting
RSWilson1 Mar 4, 2024
22d3941
Fixed tabix for igv reports and removed redundant arguments.
RSWilson1 Mar 4, 2024
449f9f9
updated py version in yaml
RSWilson1 Mar 4, 2024
6151380
updated comment
RSWilson1 Mar 4, 2024
963f28b
Updated doc string for script
RSWilson1 Mar 4, 2024
07caaf3
Updated dtypes
RSWilson1 Mar 4, 2024
865943c
Update dtypes to reduce memory usage by ~2Gb for GRCh37.
RSWilson1 Mar 4, 2024
98c7c46
Updated dtypes to improve memory usage and changed gff2pandas and tes…
RSWilson1 Mar 5, 2024
72b4982
Remove redundant prints
RSWilson1 Mar 5, 2024
1827ba5
Removed print
RSWilson1 Mar 5, 2024
9791357
Example of a mixed dataset
RSWilson1 Mar 5, 2024
69751da
Changed dtype for hgnc_id now works
RSWilson1 Mar 5, 2024
f412ddb
PEP8
RSWilson1 Mar 5, 2024
1b8599c
PEP8
RSWilson1 Mar 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/pytest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
- name: Set up Python 3.12.2
uses: actions/setup-python@v1
with:
python-version: 3.8
python-version: 3.12.2
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
47 changes: 33 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# gene_annotation2bed

## Purpose

To provide bed files for custom gene-level annotation with VEP.
This custom script processes a list of ids (HGNC, transcript) or coordinates with associated annotation, into a comprehensive bed file for the corresponding refseq transcripts for each ID entry.

Expand All @@ -16,10 +17,14 @@ grep -v "#" myData.bed | sort -k1,1 -k2,2n -k3,3n -t$'\t' | bgzip -c > myData.be
tabix -p bed myData.bed.gz
```

When using VEP, the JSON should specify 'overlap' as in the VEP custom documentation when annotating gene/transcript level annotation. For more information on how to use custom annotation please see:
[Internal Document which is helpful](https://cuhbioinformatics.atlassian.net/wiki/spaces/DV/pages/2605711365/VEP+Config+file).

Workflow diagram showing TSV containing IDs and annotation to bed file and how it is used in VEP and visualised in IGV:
![Workflow diagram showing TSV containing IDs and annotation to bed file and how it is used in VEP and visualised in IGV using a VCF](https://raw.githubusercontent.com/eastgenomics/gene_annotation2bed/sprint_2/Workflow.png)

---

## What are typical use cases for this script?

- Converting a list of HGNC ids + associated gene level annotation information
Expand All @@ -28,54 +33,68 @@ Workflow diagram showing TSV containing IDs and annotation to bed file and how i
Or using exact coordinates to flag a regions such as TERT promoter.

---

## What data are required for this script to run?

- List of ids and annotation information in TSV format.
- Human Genome Reference (i.e. hs37d5).
- RefSeq Transcripts file (gff3) from 001_reference.
- Human Genome Reference (i.e. hs37d5) for generating IGV report.
- RefSeq Transcripts file (gff3; `GCF_000001405.25_GRCh37.p13_genomic.gff`) and corresponding gff assembly report, i.e.`GCF_000001405.25_GRCh37.p13_assembly_report.txt` from 001_reference
or from [Refseq FTP Server](https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/).
![Image of refseq tsv structure, first 15 lines](images/image.png)

---
## Notes
The current working logic of this script is to select only refseq transcripts with NM_ prefix.
The bed returned is zero-based.

---

## Example Command

```bash
python gene_annotation2bed.py -ann annotation.tsv -o output_suffix -ref hg38 -f 5 --assembly_report assembly_report.txt -ref_igv ref_genome.fasta -gff your_file.gff -pickle pickle_file.pkl
```

---

## What inputs are required for this app to run?

### Required

- `-ann`, `--annotation_file` (`str`): Path to the annotation file (TSV), this file is essential for the app to execute successfully.
- `-o`, `output` (`str`): Output file suffix, required for specifying the suffix for the generated output files.
- `-build`, `--genome_build` (`str`): Reference genome build (hg19/hg38), choose either 'hg19' or 'hg38' based on your requirements.
- `-f`, `--flanking` (`int`): Flanking size, an integer value representing the size of flanking regions for each gene, transcript or coordinates provided. Default = 0.
- `-as`, `--assembly_summary` (`str`): Path to assembly summary file, necessary for the app to gather assembly information, this allows for the script to map between refseq accessions and chromosomes.
- `-gff` (`str`): Path to GFF file containing all relevant transcripts for assay, available in 001_reference i.e. GCF_000001405.25_GRCh37.p13_genomic.gff.
- `-gff` (`str`): Path to GFF file containing all relevant transcripts for assay, available in 001_reference i.e. `GCF_000001405.25_GRCh37.p13_genomic.gff`.

### Useful ones

#### Files

- `-ref_igv`, `--reference_file_for_igv` (`file`): Path to the Reference genome fasta file for igv_reports, used in generating IGV reports.

## Misc
- `--pickle` or `-pkl` (`str`): Import GFF as a pickle file, this is for testing mostly to speed-up running, so gff isn't processed each time.

## Example Command

```bash
python gene_annotation2bed.py -ann annotation.tsv -o output_suffix -ref hg38 -f 5 --assembly_summary assembly_summary.txt -ref_igv ref_genome.fasta --hgnc_dump_path hgnc_info.tsv -gff your_file.gff -pickle pickle_file.pkl
```
- `--pickle` or `-pkl` (`str`): Import GFF as a pickle file, this is for testing mostly to speed-up running, so gff isn't processed each time.

---

## Requirements

To generate IGV reports:
HTSlib is required for generating IGV report with the bed file to check the accuracy.
This uses tabix and bgzip. Version: 1.19.1.

General requirements see requirements.txt for more info on versions.

- pysam
- pandas
- igv-reports (v)
- numpy
- re
- pandas
- igv-reports (v1.12.0)
- re (std lib)

install using `requirements.txt`. `pip install requirements.txt`
Install using `requirements.txt`. `pip install requirements.txt`
Alternatively you can use conda and the yml provided.

---

Expand Down
7 changes: 7 additions & 0 deletions data/mixed_dataset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
ID annotation
NM_000059 TSG
76 Oncogene
391 Oncogene
NM_000546 TSG
NM_000124 TSG
chr17:1-100000 Oncogene
54 changes: 54 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: myenvtest
channels:
- bioconda
- r
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- bzip2=1.0.8=hd590300_5
- ca-certificates=2024.2.2=hbcca054_0
- ld_impl_linux-64=2.40=h41732ed_0
- libexpat=2.5.0=hcb278e6_1
- libffi=3.4.2=h7f98852_5
- libgcc-ng=13.2.0=h807b86a_5
- libgomp=13.2.0=h807b86a_5
- libnsl=2.0.1=hd590300_0
- libsqlite=3.45.1=h2797004_0
- libuuid=2.38.1=h0b41bf4_0
- libxcrypt=4.4.36=hd590300_1
- libzlib=1.2.13=hd590300_5
- ncurses=6.4=h59595ed_2
- openssl=3.2.1=hd590300_0
- pip=24.0=pyhd8ed1ab_0
- python=3.12.2=hab00c5b_0_cpython
- readline=8.2=h8228510_1
- setuptools=69.1.1=pyhd8ed1ab_0
- tk=8.6.13=noxft_h4845f30_101
- wheel=0.42.0=pyhd8ed1ab_0
- xz=5.2.6=h166bdaf_0
- pip:
- argcomplete==3.2.2
- certifi==2024.2.2
- charset-normalizer==3.3.2
- coverage==7.4.3
- idna==3.6
- igv-reports==1.12.0
- iniconfig==2.0.0
- intervaltree==3.1.0
- numpy==1.26.4
- packaging==23.2
- pandas==2.2.1
- pluggy==1.4.0
- pysam==0.22.0
- pytest==8.0.2
- pytest-cov==4.1.0
- python-dateutil==2.9.0
- pytz==2024.1
- requests==2.31.0
- six==1.16.0
- sortedcontainers==2.4.0
- tzdata==2024.1
- urllib3==2.2.1
prefix: /home/rswilson1/anaconda3/envs/myenvtest
Loading
Loading