Merge branch 'main' into nextflow_pipeline

compbiocore · Oct 18, 2023 · a44f74f · a44f74f
2 parents f25700f + c009960
commit a44f74f
Show file tree

Hide file tree

Showing 18 changed files with 55 additions and 1,109 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,12 @@
 *.fastq
 *.sam
 *.bam
+*.sif
+*.simg
+slurm*
+/0_data/gisaid.tsv
+/0_data/gisaid.fasta
+/3_results/
 
 ### BELOW IS FOR R projects. Feel free to delete if you don't have an R project ###
 # History files

diff --git a/0_data/README.txt b/0_data/README.txt
@@ -0,0 +1 @@
+This is the directory in which GISAID data will be downloaded in to.
diff --git a/Pipeline.txt b/Pipeline.txt
@@ -0,0 +1,30 @@
+### Fully functional
+
+nextclade run 	--input-dataset nextclade_dataset 	--output-json results/nextclade.json 	--output-csv results/nextclade.csv 	--output-tsv results/nextclade.tsv 	--output-tree results/nextclade.auspice.json 	--input-qc-config src/qcRulesConfig.json 50seq_test.fasta > results/nextclade.log
+
+nextalign run 	--genemap=src/genemap.gff 	--genes=E,M,N,ORF10,ORF14,ORF1a,ORF1b,ORF3a,ORF6,ORF7a,ORF7b,ORF8,ORF9b,S 	--output-all=results/nextalign 	--input-ref=src/reference.fasta 50seq_test.fasta
+
+#awk '{print $1,$2,$5,$6,$7,$8,$27}' FS='\t' OFS='\t' /gpfs/data/ris3/0_data/gisaid_20220926/metadataCombined.tsv > metadata.tsv
+
+python scripts/nextstrain-diagnostics.py --alignment results/nextalign/nextalign.aligned.fasta --reference src/reference.gb --metadata metadata.tsv --output-diagnostics results/nextstrain-diagnostics.tsv --output-flagged results/nextstrain-diagnostics-flagged.tsv --output-exclusion-list results/nextstrain-diagnostics-exclusion.txt 
+
+python scripts/qc.py
+python src/mutations.py
+python src/concern.py
+Rscript src/num-sequences.R
+Rscript src/num-voc-voi.R
+Rscript src/top-lineages.R
+Rscript src/ridoh-report.R
+Rscript src/figures.R
+
+
+###
+
+Still need to install IQtree and get that running
+
+# Tree
+$BIN/iqtree2 -s results/nextalign-references/ri_sequences_qc_references.aligned.fasta --prefix results/iqtree2 -st DNA -m GTR+F --mem 8G
+
+
+#### Once complete here need to update all paths to be accurate within the directory framework
+#### Also - need to merge with the download so it is all one fluid process
diff --git a/README.md b/README.md
@@ -1,20 +1,22 @@
 # Covid19 Analysis Pipeline
 
-Template for analyses repositories. For more information see https://compbiocore-brown.slab.com/posts/data-organisation-for-analysis-repos-fdi2cddd. Folders that should be present in all such repositories are:
+## Directory Structure
+
+ * **0_data:** is an empty directory in which to download sequneces and metadata from GISAID for analyses.
+ * **1_scripts:** contains shell scripts to run the pipeline as reflected in ```/covid19_analysis/1_scripts``` the singularity image can be pulled directly to oscar or your local machine using ```singularity pull covid19.sif docker://ericsalomaki/covid_new_pango:05092023``` from the `1_scripts` directory.
+ * **2_metadata:** contains the ```Dockerfile``` that was used to create the container for running the pipeline, a GFF file, QC rules file, and the reference fasta file and genbank file.
+ * **3_results** will be created while the pipeline is running and results will be written to ```/covid19_analysis/3_results/${YYYYMMDD}```
 
- * **metadata:** contains the directory
- ```2_metadata``` which has a GFF file, QC rules file, and the reference fasta file and is located on oscar at ```/gpfs/data/ris3/2_metadata```; and the  ```Dockerfile``` that was used to initially create the container for running the pipeline
- * **scripts:** contains shell scripts to run the pipeline as reflected in ```/gpfs/data/ris3/1_scripts``` the singularity image which is also located in ```/gpfs/data/ris3/1_scripts``` can be pulled directly to oscar using ```singularity pull covid19.sif docker://ghcr.io/compbiocore/covid12162022:latest```
 
 ## Running Pipeline via Oscar Slurm Batch Submission  
 
-To run the covid pipeline, navigate to ```/gpfs/data/ris3/1_scripts/``` and run:   
+To run the covid pipeline, navigate to ```/PATH/TO/CLONED/REPO/covid19_analysis/1_scripts/``` and run:   
 ```
-sbatch /gpfs/data/ris3/0_data/gisaid_20220926/run_slurm.sh /gpfs/data/ris3/PATH/TO/SEQUENCE/DATA
+sbatch run_slurm.sh /ABSOLUTE/PATH/TO/SEQUENCE/DATA/covid_sequences.fasta
 ```  
-Results will be produced in ```/gpfs/data/ris3/3_results/${YYYYMMDD}```
+Results will be produced in ```/covid19_analysis/3_results/${YYYYMMDD}```
 
-A run with ~20,000 input sequences takes roughly 8 hours to complete
+A run with ~20,000 input sequences takes roughly 30 minutes to complete the primary pangolin analyses and produce figures on Oscar with 24 threads and 128G RAM allocated, however the IQ-tree analysis will run for several days. If incomplete, IQ-tree uses checkpoints and therefore the analysis can be continued beyond the allocated time, if necessary.
 
 
 ## Running Pipeline via Oscar Interactive Session
@@ -23,38 +25,38 @@ To run thie pipeline in an interact session, first enter a screen `screen -S JOB
 
 Navigate to the `1_scripts` directory:  
 ```
-cd /gpfs/data/ris3/1_scripts
+cd /PATH/TO/CLONED/REPO/covid19_analysis/1_scripts
 ```
 
 Enter the singularity container and mount the parent directory:
 
 ```
-singularity exec -B /gpfs/data/ris3/ /gpfs/data/ris3/1_scripts/covid12162022_latest.sif bash 
+singularity exec -B /ABSOLUTE/PATH/TO/CLONED/REPO/covid19_analysis/ /PATH/TO/CLONED/REPO/covid19_analysis/1_scripts/covid19.sif bash 
 ```  
 
 Once inside the container, run:
 
 ``` 
-bash run.sh /gpfs/data/ris3/PATH/TO/SEQUENCE/DATA
+bash run.sh /ABSOLUTE/PATH/TO/SEQUENCE/DATA/covid_sequences.fasta
 ```
 
 To leave the screen use `ctl + a + d` and to return use `screen -r JOBNAME`  
 
-Results will be produced in `/gpfs/data/ris3/3_results/${YYYYMMDD}`
+Results will be produced in `/PATH/TO/CLONED/REPO/covid19_analysis/3_results/${YYYYMMDD}`
 
 ## Example Usage
 ```
-sbatch /gpfs/data/ris3/0_data/gisaid_20220926/run_slurm.sh /gpfs/data/ris3/0_data/gisaid_20220926/sequenceData.fasta
+sbatch /PATH/TO/CLONED/REPO/covid19_analysis/1_scripts/run_slurm.sh /PATH/TO/CLONED/REPO/covid19_analysis/0_data/sequenceData.fasta
 ```
 
 # CBC Project Information
 
 ```
-title: Covid19 docker container
+title: Covid19 analysis pipeline
 tags:
 analysts:
-git_repo_url:
-resources_used: Pangolin, Nextclade, Nextalign
+git_repo_url: https://github.com/compbiocore/covid19_analysis
+resources_used: Pangolin, Nextclade, Nextalign, IQ-Tree, R
 summary: 
 project_id:
 ```
diff --git a/metadata/documentation/docs/index.md → documentation/docs/index.md b/metadata/documentation/docs/index.md → documentation/docs/index.md
diff --git a/metadata/documentation/docs/workflow.md → documentation/docs/workflow.md b/metadata/documentation/docs/workflow.md → documentation/docs/workflow.md
diff --git a/metadata/documentation/mkdocs.yml → documentation/mkdocs.yml b/metadata/documentation/mkdocs.yml → documentation/mkdocs.yml
diff --git a/metadata/gisaid_download.R → gisaid_download.R b/metadata/gisaid_download.R → gisaid_download.R
diff --git a/metadata/.DS_Store b/metadata/.DS_Store
diff --git a/metadata/2_metadata/genemap.gff b/metadata/2_metadata/genemap.gff
diff --git a/metadata/2_metadata/qcRulesConfig.json b/metadata/2_metadata/qcRulesConfig.json
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		This is the directory in which GISAID data will be downloaded in to.