Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Str conda #232

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
8af0718
yaml for EHDN report rule with required R and python packages
aarthi-mohan Apr 29, 2024
478acf8
added conda env for rule EHdn_report
aarthi-mohan Apr 29, 2024
c28b23c
readme for str scripts
aarthi-mohan Aug 12, 2024
54e5ff3
move str scripts to script/str directory
aarthi-mohan Aug 14, 2024
431436e
fixed a bug: missing 2nd argument for 1000G counts per STR and correc…
aarthi-mohan Aug 16, 2024
072caf9
ehdn: split script into new rules, and add env
aarthi-mohan Aug 16, 2024
d2f96be
moved hardcoded paths in various EHDN script to config_hpf.yaml
aarthi-mohan Aug 26, 2024
e9f1f6d
split the ehdn_report rule into outlier detection and annotation with…
aarthi-mohan Aug 26, 2024
2322850
added missing scripts reqd for ehdn
aarthi-mohan Aug 26, 2024
13380b8
yaml for dbscan outlier step + small var fix + removed hard-coded pat…
aarthi-mohan Aug 28, 2024
2701159
minor fix and added missing package to yaml
aarthi-mohan Aug 30, 2024
8d8f458
Merge remote-tracking branch 'origin/master' into str-conda
aarthi-mohan Sep 23, 2024
33e6e26
added and updated references from "arun" folder to "ccm_dccforge/dccd…
aarthi-mohan Oct 2, 2024
de9d8ed
changed hard-coded paths to arguments; removed dates from out filenames
aarthi-mohan Oct 2, 2024
3e2b5d3
split dbscan rule into two: dbscan outlier and merge epxansion
aarthi-mohan Oct 2, 2024
586bf49
summary of the changes done
aarthi-mohan Oct 2, 2024
db5429a
fixed output path
aarthi-mohan Oct 2, 2024
2a3f247
updated changes
aarthi-mohan Oct 15, 2024
32ed7af
updated md
aarthi-mohan Oct 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion config_hpf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,11 @@ tools:
cre: "~/cre"
crg: "~/crg"
crg2: "~/crg2"
ehdn: "/hpf/largeprojects/ccmbio/arun/Tools/EHDN.TCAG/ExpansionHunterDenovo-v0.7.0"
ehdn: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/crg2-non-conda-tools/EHDN.TCAG/ExpansionHunterDenovo-v0.7.0"
mity: "/hpf/largeprojects/ccmbio/ajain/mity/mity_package/bin"
melt: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/crg2-non-conda-tools/MELT/MELTv2.2.2/MELT.jar"
orad: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/crg2-non-conda-tools/orad_2_6_1/orad"
annovar: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/crg2-non-conda-tools/annovar"


ref:
Expand Down Expand Up @@ -123,6 +124,14 @@ annotation:
1000g: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunter/1000G_EH_v1.0.tsv"
ehdn:
files: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo/"
g1k_outlier: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo/1000G_outlier"
g1k_manifest: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo/manifest.1000G.txt"
g1k_samples: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo/1000G.samples.txt"
trf: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo/UCSC_simple_repeats_hg19_coord_motif.tsv"
omim: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo/OMIM_hgnc_join_omim_phenos_2023-06-22.tsv"
gnomad: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/ExpansionHunterDenovo//gnomad.v2.1.1.lof_metrics.by_gene.txt"
annovar_db: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/annotation/annovar/humandb/"

cre:
database_path: "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/variation/"
melt:
Expand Down
45 changes: 45 additions & 0 deletions documentation/str.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# ExpansionHunterDenovo version:
EHDN v0.7.0 is not available from GitHub or conda anymore, and the 1000G profiles from Brett Trost are based on this, so keeping this version; tool path in 'config.yaml' is updated to
`config['tools']['ehdn']: "/hpf/largeprojects/ccm_dccforge/dccdipg/Common/crg2-non-conda-tools/EHDN.TCAG/ExpansionHunterDenovo-v0.7.0"`

# Description of scripts used before the changes:
## BASH and R scripts:

1. scripts are fetched from `crg/str` repo
2. ~/crg/crg.ehdn.sh: runs EHDN with many checks/searches for the relevant bam files based on run type: exome/genome and cre/crg/crg2 pipeline directory structures
3. ~/crg/ehdn_report.sh: main script that uses all the R packages from ~/crg/str, 1000G EHDN profiles, and generates script
- ~/crg/str/DBSCAN.EHdn.parallel.R: library(dbscan), library(ggplot2), library(data.table), library(parallel), library(doSNOW)
- ~/crg/str//mergeExpansions.R: library(data.table), library(GenomicRanges), library(ggplot2), library(cowplot), library(Biostrings)


## Python scripts:
1. ~/crg/str/compare_anchored_irrs.py: depends on "core" package present locally
2. ~/crg/str/find_outliers.py: numpy, matplotlib
3. ~/crg/str/generate_EH_genotype_table.generic.py: argparse, glob, BTlib(local), collections, pandas, path
4. ~/crg/str/combine_counts.py: argparse, json, core(local)
5. ~/crg/str/BTlib.py: re, statistics, string, random, itertools, copy, collections, numpy, pandas, copy, docx
6. ~/crg/str/add_gene+threshold_to_EH_column_headings2.py: argparse
7. ~/crg/str/format_for_annovar.py: collections, re
8. ~/crg/str/format_from_annovar.py: pandas, xlsxwriter


# Summary of changes related to issue 142:
1. Moved all required scripts from `~/crg` and `~/cre` to `~/crg2/scripts/str`
2. Removed hard-coded paths from str related scripts and added them to `config.yaml` to be passed as arguments via snakemake params.
3. Moved following annotations and tools to /hpf/largeprojects/ccm_dccforge/dccdipg/Common/:
annotation/ExpansionHunterDenovo/UCSC_simple_repeats_hg19_coord_motif.tsv
crg2-non-conda-tools/EHDN.TCAG/ExpansionHunterDenovo-v0.7.0
annotation/ExpansionHunterDenovo/1000G_JSON
updated annotation/ExpansionHunterDenovo/manifest.1000G.txt
4. Split the single "EHdn_report" rule into four, with separate env for each rule
- EHDN_mark_outliers:
- EHDN_DBSCAN_outlier
- EHDN_merge_expansions
- EHDN_annovar
4. Created `envs/ehdn-dbscan.yaml` for rules "EHDN_DBSCAN_outlier" and "EHDN_merge_expansions". The version of the R-packages are not the same as in `ccmmarvin`. If I constrain them to versions available in 'ccmmarvin`, then conda fails to create env due to conflicts. The versions installed on ccmmarvin were done manually in 2020, not via conda,
5. Fixed minor bugs, and removed unwanted yaml.
6. Edited above R scripts to remove hard-coded file paths, add command-line arguments, and removed suffixing output with date, as the Snakemake rule requires the output names be known before execution (using dynamic only works if all outputs from a rule are dynamic)
7. Tested with NA12878



12 changes: 12 additions & 0 deletions envs/ehdn-dbscan.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
channels:
- bioconda
- conda-forge
dependencies:
- r
- r-dbscan=1.1-8
- r-ggplot2
- r-data.table
- r-doSNOW=1.0.19
- r-cowplot
- bioconductor-genomicranges
- bioconductor-biostrings
24 changes: 16 additions & 8 deletions envs/ehdn-report.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,20 @@ channels:
- bioconda
- conda-forge
dependencies:
- python=3.7.8
- python-docx=0.8.10
- pandas=1.1.2
- numpy=1.19.1
- scipy=1.5.2
- xlsxwriter=1.3.7
- r-base=3.4.1
- r-dbscan=1.1.5
- r-ggplot2
- r-data.table
- r-doSNOW
- r-cowplot
- bioconductor-genomicranges
- bioconductor-genomeinfodb
- bioconductor-genomeinfodbdata
- bioconductor-biostrings
- r-ggplot2=3.1.0
- r-data.table=1.11.4
- r-doSNOW=1.0.20
- r-cowplot=0.9.3
- bioconductor-genomicranges=1.32.7
- bioconductor-genomeinfodb=1.16.0
- bioconductor-genomeinfodbdata=1.1.0
- bioconductor-biostrings=2.48.0

3 changes: 2 additions & 1 deletion envs/ehdn.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ channels:
- conda-forge
dependencies:
- python =3.7.7
- scipy =1.7.3
- scipy =1.7.3
- numpy
10 changes: 10 additions & 0 deletions envs/ehdn_outlier.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
channels:
- bioconda
- conda-forge
dependencies:
- python=3.7.8
- python-docx=0.8.10
- pandas=1.1.2
- numpy
- matplotlib
- xlsxwriter=1.3.7
142 changes: 111 additions & 31 deletions rules/str.smk
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ rule EH:
bam = "str/EH/{family}_{sample}_realigned.bam"
params:
ref = config["ref"]["genome"],
sex = lambda w: "`sh {}/scripts/str_helper.sh decoy_rm/{}_{}.no_decoy_reads.bam`".format(workflow.basedir, w.family, w.sample),
sex = lambda w: "`sh {}/scripts/str/str_helper.sh decoy_rm/{}_{}.no_decoy_reads.bam`".format(workflow.basedir, w.family, w.sample),
catalog = config["annotation"]["eh"]["catalog"]
log:
"logs/str/{family}_{sample}-EH.log"
Expand All @@ -31,28 +31,28 @@ rule EH_report:
conda:
"../envs/eh-report.yaml"
shell:
'''
echo "generating multi-sample genotypes" > {log}
python {params.crg2}/scripts/generate_EH_genotype_table.generic.py str/EH > {output.tsv}
echo "annotating gene name & size threshold" > {log}
python {params.crg2}/scripts/add_gene+threshold_to_EH_column_headings2.py {output.tsv} {params.trf} > {output.annot}
echo "generating final xlsx file" > {log}
python {params.crg2}/scripts/eh_sample_report.py {output.annot} {params.g1000} {output.xlsx}
"""
echo "generating multi-sample genotypes" >> {log}
python {params.crg2}/scripts/str/generate_EH_genotype_table.generic.py str/EH > {output.tsv}
echo "annotating gene name & size threshold" >> {log}
python {params.crg2}/scripts/str/add_gene+threshold_to_EH_column_headings2.py {output.tsv} {params.trf} > {output.annot}
echo "generating final xlsx file" >> {log}
python {params.crg2}/scripts/str/eh_sample_report.py {output.annot} {params.g1000} {output.xlsx}
prefix=`echo {output.xlsx} | awk '{{split($1,a,".xlsx"); print a[1]; }}'`;
d=`date +%Y-%m-%d`
outfile="${{prefix}}.${{d}}.xlsx";
echo "Copying final report to filaname with timestamp: $outfile" > {log}
echo "Copying final report to filaname with timestamp: $outfile" >> {log}
cp {output.xlsx} $outfile
'''
"""

rule EHdn:
input: expand("decoy_rm/{family}_{sample}.no_decoy_reads.bam",family=config["run"]["project"], sample=samples.index)
output:
json = "str/EHDN/{family}_EHDN_str.tsv"
params:
crg = config["tools"]["crg"],
crg2 = config["tools"]["crg2"],
ehdn = config["tools"]["ehdn"],
ehdn_files = config["annotation"]["ehdn"]["files"],
g1k_manifest = config["annotation"]["ehdn"]["g1k_manifest"],
ref = config["ref"]["genome"],
# ref = config["ref"]["genome"],
# irr_mapq = config["params"]["EHDN"]["irr_mapq"],
Expand All @@ -63,27 +63,107 @@ rule EHdn:
"../envs/ehdn.yaml"
shell:
'''
EHDN={params.ehdn} EHDN_files={params.ehdn_files} ref={params.ref} script_dir={params.crg} sh {params.crg}/crg.ehdn.sh {wildcards.family} crg2
EHDN={params.ehdn} g1k_manifest={params.g1k_manifest} ref={params.ref} script_dir={params.crg2}/scripts/str sh {params.crg2}/scripts/str/crg.ehdn.sh {wildcards.family} crg2
'''

rule EHdn_report:
#rule EHdn_report:
# input: "str/EHDN/{family}_EHDN_str.tsv".format(family=config["run"]["project"])
# output: "report/str/{family}.EHDN.xlsx"
# log: "logs/str/{family}_EHdn_report.log"
# params:
# repdir = "report/str",
# crg2 = config["tools"]["crg2"],
# family = config["run"]["project"],
# outdir = "str/EHDN",
# conda: "../envs/ehdn-report.yaml"
# shell:
# '''
# export PATH="/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/bin/:$PATH"
# sh {params.crg2}/scripts/str/ehdn_report.sh {params.family} {params.outdir}
# date=`date +%Y-%m-%d`;
# f={params.outdir}/outliers/{params.family}.EHDN.${{date}}.xlsx;
# if [ -f $f ]; then
# if [ ! -d {params.repdir} ]; then mkdir -p {params.repdir}; fi;
# mv $f {params.repdir}
# cp {params.repdir}/{params.family}.EHDN.${{date}}.xlsx {params.repdir}/{params.family}.EHDN.xlsx
# else exit; fi;
# '''

rule EHDN_mark_outliers:
input: "str/EHDN/{family}_EHDN_str.tsv".format(family=config["run"]["project"])
output: "report/str/{family}.EHDN.xlsx"
log: "logs/str/{family}_EHdn_report.log"
output: "str/EHDN/{family}_outliers.txt".format(family=config["run"]["project"])
params:
repdir = "report/str",
crg = config["tools"]["crg"],
family = config["run"]["project"],
outdir = "str/EHDN",
crg2 = config["tools"]["crg2"],
g1k_outlier = config["annotation"]["ehdn"]["g1k_outlier"]
conda:
"../envs/ehdn.yaml"
shell:
'''
PATH="/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/bin/:$PATH"
sh {params.crg}/ehdn_report.sh {params.family} {params.outdir}
date=`date +%Y-%m-%d`;
f={params.outdir}/outliers/{params.family}.EHDN.${{date}}.xlsx;
if [ -f $f ]; then
if [ ! -d {params.repdir} ]; then mkdir -p {params.repdir}; fi;
mv $f {params.repdir}
cp {params.repdir}/{params.family}.EHDN.${{date}}.xlsx {params.repdir}/{params.family}.EHDN.xlsx
else exit; fi;
'''
"""
python {params.crg2}/scripts/str/find_outliers.py {input} {output}
cat {output} {params.g1k_outlier} > temp && mv temp {output}
"""

rule EHDN_DBSCAN_outlier:
input:
profile = "str/EHDN/{family}_EHDN_str.tsv".format(family=config["run"]["project"]),
outliers = "str/EHDN/{family}_outliers.txt".format(family=config["run"]["project"]),
output:
clean="str/EHDN/outliers/clean.samples.txt",
expansions="str/EHDN/outliers/EHdn.expansions.tsv",
tr="str/EHDN/outliers/samples.with.TRcount.txt"

params:
outdir="str/EHDN/outliers",
crg2=config["tools"]["crg2"],
trf=config["annotation"]["ehdn"]["trf"],
g1k_samples=config["annotation"]["ehdn"]["g1k_samples"]
conda: "../envs/ehdn-dbscan.yaml"
shell:
"""
Rscript {params.crg2}/scripts/str/DBSCAN.EHdn.parallel.R --infile {input.profile} --outpath {params.outdir} --outlierlist {input.outliers} --a1000g {params.g1k_samples} --exp {output.expansions}
echo "Finished running DBSCAN.EHdn.parallel.R";
echo "`ls {params.outdir}`";

"""

rule EHDN_merge_expansions:
input: "str/EHDN/outliers/EHdn.expansions.tsv", "str/EHDN/{family}_EHDN_str.tsv".format(family=config["run"]["project"])
output: "str/EHDN/outliers/merged.rare.expansions.tsv","str/EHDN/outliers/merged.rare.expansions.forannotation.tsv",
"str/EHDN/outliers/map.TRF.EHdn.0.66.tsv", "str/EHDN/outliers/merged.ehdn.tsv",
params:
outdir="str/EHDN/outliers",
crg2=config["tools"]["crg2"],
trf=config["annotation"]["ehdn"]["trf"],
g1k_samples=config["annotation"]["ehdn"]["g1k_samples"]
conda: "../envs/ehdn-dbscan.yaml"
shell:
"""
Rscript {params.crg2}/scripts/str/mergeExpansions.R --rscript {params.crg2}/scripts/str/ExpansionAnalysisFunctions.R --ehdn {input[1]} --trf {params.trf} --outlier {input[0]} --outpath {params.outdir}
"""


rule EHDN_annovar:
input:
merged_exp = "str/EHDN/outliers/merged.rare.expansions.tsv",
ehdn_exp = "str/EHDN/outliers/EHdn.expansions.tsv"
output:
merged_rare_exp = "str/EHDN/outliers/merged.rare.EHdn.expansion.tsv",
omim_out = "str/EHDN/outliers/merged.rare.EHdn.expansion-OMIM.hg19_multianno.txt",
gnomad_out = "str/EHDN/outliers/merged.rare.EHdn.expansion-gnoMAD.hg19_multianno.txt",
xlsx="report/str/{family}.EHDN.xlsx".format(family=config["run"]["project"])
params:
crg2 = config["tools"]["crg2"],
g1k_manifest = config["annotation"]["ehdn"]["g1k_manifest"],
annovar = config["tools"]["annovar"],
annovar_db = config["annotation"]["ehdn"]["annovar_db"],
omim = config["annotation"]["ehdn"]["omim"],
gnomad = config["annotation"]["ehdn"]["gnomad"],
prefix = "str/EHDN/outliers/merged.rare.EHdn.expansion"
conda: "../envs/eh-report.yaml"
shell:
"""
python {params.crg2}/scripts/str/format_for_annovar.py {input.ehdn_exp} {input.merged_exp} {params.g1k_manifest}
{params.annovar}/table_annovar.pl {output.merged_rare_exp} {params.annovar_db} -buildver hg19 -outfile {params.prefix}"-OMIM" -remove --onetranscript --otherinfo -protocol refGene -operation gx -nastring . -xreffile {params.omim}
{params.annovar}/table_annovar.pl {output.merged_rare_exp} {params.annovar_db} -buildver hg19 -outfile {params.prefix}"-gnoMAD" -remove --onetranscript --otherinfo -protocol refGene -operation gx -nastring . -xreffile {params.gnomad}
python {params.crg2}/scripts/str/format_from_annovar.py {output.gnomad_out} {output.omim_out} {output.xlsx}
"""
File renamed without changes.
Loading