Annotation Pipeline for BT&S

This repository will make up the evidence for the Bioinformatics Software and Tools module of the ARU/Sanger BSc Bioinformatics Year 2 Course.

Installation of pipeline

This can be achieved through the use of:

conda create --file environment.yml

However, this can take quite some time and personally was faster installing it all separately.

Usage

Local Machine Usage 1 - Clone repo

2 - cd annotation-pipeline

3 - Download data into the raw_data folders (links below).

4 - python3 config-writer.py REFERENCE SAMPLE_1 SAMPLE_2 CHR_OF_INTEREST

5 - snakemake --configfile config.yaml --cores 10

This pipeline will require the igv.sh file to have been set to 15gb. (-Xmx15g compared to -Xmx4g) this is in order to produce the .genome file.

If this can't be used then please use the built-in database for Human (hg38).

snpEff is set to -Xmx4g in order to annotate the vcf.

After completion:

6 - Run IGV

In my case this was bash {location of installation}/igv.sh

7 - OPTION A - build .genome file

Genomes > Create .genome

The fasta file = the reference genome

The Gene file is the .gff file downloaded earlier

7 - OPTION B - use pre-installed Human (hg38)

8 - Load annotated file

File > load from file

Navigate to folder s13 witch should contain something akin to: chr10.filt.annotated.vcf.gz

9 - Navigate to location chr10:94,760,000-94,860,000 which will centre on the gene CYP2C19.

Usage - Cluster implementation

snakemake --configfile config.yaml --cores 10 --cluster-config
./cluster.yaml --cluster "bsub -q {cluster.queue} -oo {cluster.output} 
-eo {cluster.error} -M {cluster.memory} -R {cluster.resources} -J {cluster.jobname}" 
-j 10 --use-conda

Currently in Repo

environment.yml - a list of packages used in this project

Data

Files should be downloaded into a {project dir}/raw_data/{ reference | sample_data }/{Downloaded file}.

Sample Data comes from the Utah family platinum read set.

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR622/SRR622461/SRR622461_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR622/SRR622461/SRR622461_2.fastq.gz

Mapped against GRCH38.p15:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz

We also used:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gff.gz

This was used in conjunction with the reference genome to build a .genome file to better compare the results of this pipeline when visualising the end product with IGV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Annotation Pipeline for BT&S

Installation of pipeline

Usage

Usage - Cluster implementation

Currently in Repo

Data

Files

README.md

Latest commit

History

README.md

File metadata and controls

Annotation Pipeline for BT&S

Installation of pipeline

Usage

Usage - Cluster implementation

Currently in Repo

Data