Skip to content

Commit

Permalink
Merge branch 'main' into nextflow_pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
paulcao-brown committed Oct 18, 2023
2 parents f25700f + c009960 commit a44f74f
Show file tree
Hide file tree
Showing 18 changed files with 55 additions and 1,109 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
*.fastq
*.sam
*.bam
*.sif
*.simg
slurm*
/0_data/gisaid.tsv
/0_data/gisaid.fasta
/3_results/

### BELOW IS FOR R projects. Feel free to delete if you don't have an R project ###
# History files
Expand Down
1 change: 1 addition & 0 deletions 0_data/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This is the directory in which GISAID data will be downloaded in to.
30 changes: 30 additions & 0 deletions Pipeline.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
### Fully functional

nextclade run --input-dataset nextclade_dataset --output-json results/nextclade.json --output-csv results/nextclade.csv --output-tsv results/nextclade.tsv --output-tree results/nextclade.auspice.json --input-qc-config src/qcRulesConfig.json 50seq_test.fasta > results/nextclade.log

nextalign run --genemap=src/genemap.gff --genes=E,M,N,ORF10,ORF14,ORF1a,ORF1b,ORF3a,ORF6,ORF7a,ORF7b,ORF8,ORF9b,S --output-all=results/nextalign --input-ref=src/reference.fasta 50seq_test.fasta

#awk '{print $1,$2,$5,$6,$7,$8,$27}' FS='\t' OFS='\t' /gpfs/data/ris3/0_data/gisaid_20220926/metadataCombined.tsv > metadata.tsv

python scripts/nextstrain-diagnostics.py --alignment results/nextalign/nextalign.aligned.fasta --reference src/reference.gb --metadata metadata.tsv --output-diagnostics results/nextstrain-diagnostics.tsv --output-flagged results/nextstrain-diagnostics-flagged.tsv --output-exclusion-list results/nextstrain-diagnostics-exclusion.txt

python scripts/qc.py
python src/mutations.py
python src/concern.py
Rscript src/num-sequences.R
Rscript src/num-voc-voi.R
Rscript src/top-lineages.R
Rscript src/ridoh-report.R
Rscript src/figures.R


###

Still need to install IQtree and get that running

# Tree
$BIN/iqtree2 -s results/nextalign-references/ri_sequences_qc_references.aligned.fasta --prefix results/iqtree2 -st DNA -m GTR+F --mem 8G


#### Once complete here need to update all paths to be accurate within the directory framework
#### Also - need to merge with the download so it is all one fluid process
34 changes: 18 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
# Covid19 Analysis Pipeline

Template for analyses repositories. For more information see https://compbiocore-brown.slab.com/posts/data-organisation-for-analysis-repos-fdi2cddd. Folders that should be present in all such repositories are:
## Directory Structure

* **0_data:** is an empty directory in which to download sequneces and metadata from GISAID for analyses.
* **1_scripts:** contains shell scripts to run the pipeline as reflected in ```/covid19_analysis/1_scripts``` the singularity image can be pulled directly to oscar or your local machine using ```singularity pull covid19.sif docker://ericsalomaki/covid_new_pango:05092023``` from the `1_scripts` directory.
* **2_metadata:** contains the ```Dockerfile``` that was used to create the container for running the pipeline, a GFF file, QC rules file, and the reference fasta file and genbank file.
* **3_results** will be created while the pipeline is running and results will be written to ```/covid19_analysis/3_results/${YYYYMMDD}```

* **metadata:** contains the directory
```2_metadata``` which has a GFF file, QC rules file, and the reference fasta file and is located on oscar at ```/gpfs/data/ris3/2_metadata```; and the ```Dockerfile``` that was used to initially create the container for running the pipeline
* **scripts:** contains shell scripts to run the pipeline as reflected in ```/gpfs/data/ris3/1_scripts``` the singularity image which is also located in ```/gpfs/data/ris3/1_scripts``` can be pulled directly to oscar using ```singularity pull covid19.sif docker://ghcr.io/compbiocore/covid12162022:latest```

## Running Pipeline via Oscar Slurm Batch Submission

To run the covid pipeline, navigate to ```/gpfs/data/ris3/1_scripts/``` and run:
To run the covid pipeline, navigate to ```/PATH/TO/CLONED/REPO/covid19_analysis/1_scripts/``` and run:
```
sbatch /gpfs/data/ris3/0_data/gisaid_20220926/run_slurm.sh /gpfs/data/ris3/PATH/TO/SEQUENCE/DATA
sbatch run_slurm.sh /ABSOLUTE/PATH/TO/SEQUENCE/DATA/covid_sequences.fasta
```
Results will be produced in ```/gpfs/data/ris3/3_results/${YYYYMMDD}```
Results will be produced in ```/covid19_analysis/3_results/${YYYYMMDD}```

A run with ~20,000 input sequences takes roughly 8 hours to complete
A run with ~20,000 input sequences takes roughly 30 minutes to complete the primary pangolin analyses and produce figures on Oscar with 24 threads and 128G RAM allocated, however the IQ-tree analysis will run for several days. If incomplete, IQ-tree uses checkpoints and therefore the analysis can be continued beyond the allocated time, if necessary.


## Running Pipeline via Oscar Interactive Session
Expand All @@ -23,38 +25,38 @@ To run thie pipeline in an interact session, first enter a screen `screen -S JOB

Navigate to the `1_scripts` directory:
```
cd /gpfs/data/ris3/1_scripts
cd /PATH/TO/CLONED/REPO/covid19_analysis/1_scripts
```

Enter the singularity container and mount the parent directory:

```
singularity exec -B /gpfs/data/ris3/ /gpfs/data/ris3/1_scripts/covid12162022_latest.sif bash
singularity exec -B /ABSOLUTE/PATH/TO/CLONED/REPO/covid19_analysis/ /PATH/TO/CLONED/REPO/covid19_analysis/1_scripts/covid19.sif bash
```

Once inside the container, run:

```
bash run.sh /gpfs/data/ris3/PATH/TO/SEQUENCE/DATA
bash run.sh /ABSOLUTE/PATH/TO/SEQUENCE/DATA/covid_sequences.fasta
```

To leave the screen use `ctl + a + d` and to return use `screen -r JOBNAME`

Results will be produced in `/gpfs/data/ris3/3_results/${YYYYMMDD}`
Results will be produced in `/PATH/TO/CLONED/REPO/covid19_analysis/3_results/${YYYYMMDD}`

## Example Usage
```
sbatch /gpfs/data/ris3/0_data/gisaid_20220926/run_slurm.sh /gpfs/data/ris3/0_data/gisaid_20220926/sequenceData.fasta
sbatch /PATH/TO/CLONED/REPO/covid19_analysis/1_scripts/run_slurm.sh /PATH/TO/CLONED/REPO/covid19_analysis/0_data/sequenceData.fasta
```

# CBC Project Information

```
title: Covid19 docker container
title: Covid19 analysis pipeline
tags:
analysts:
git_repo_url:
resources_used: Pangolin, Nextclade, Nextalign
git_repo_url: https://github.com/compbiocore/covid19_analysis
resources_used: Pangolin, Nextclade, Nextalign, IQ-Tree, R
summary:
project_id:
```
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file removed metadata/.DS_Store
Binary file not shown.
18 changes: 0 additions & 18 deletions metadata/2_metadata/genemap.gff

This file was deleted.

22 changes: 0 additions & 22 deletions metadata/2_metadata/qcRulesConfig.json

This file was deleted.

Loading

0 comments on commit a44f74f

Please sign in to comment.