-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adding somatic neoepitope preparation workflow #495
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good, thanks a lot for your efforts. I think that there are still a few points to be fixed, but otherwise it is great.
tests/snappy_pipeline/workflows/test_workflows_somatic_neoepitope_prediction.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last bit: ensure that the app can deal with columns selected by number rather than by name.
Perhaps it would be good to have the choice of 4 columns
- The gene or transcript id (with or without version)
- The TPMs
- If there is no value for the TPM column, then the counts and
- If there is no value for the TPM column, then the feature lengths.
If there if no TPM column and no feature length column, then the gtf is used to get the length of transcripts, and an approximate value for the gene lengths.
snappy_pipeline/workflows/somatic_neoepitope_prediction/__init__.py
Outdated
Show resolved
Hide resolved
snappy_pipeline/workflows/somatic_neoepitope_prediction/__init__.py
Outdated
Show resolved
Hide resolved
mapper="star", | ||
library_name=rna_library, | ||
) | ||
ext = {"expression", "bam", "bai"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nesting is also pretty deep, perhaps it can be restructured (by breaking it into smaller chunks / moving common stuff into dedicated functions) a bit to make it a bit more palatable
"{mapper}.{var_caller}.{anno_caller}.{tumor_library}" | ||
) | ||
# Need to change for work on many different tools | ||
key_ext = {"vcf": ".full.vcf.gz", "vcf_tbi": ".full.vcf.gz.tbi"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice between the full annotation .full.vcf.gz
and the selected transcript only .vcf.gz
may be left to the user.
tools_somatic_variant_calling: [] # deafult to those configured for somatic_variant_calling | ||
max_depth: "4000" | ||
preparation: | ||
format: 'star' # REQUIRED - The file format of the expression file to process. (stringtie,kallisto,cufflinks,custom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before, snappy
now outputs a 2-columns GeneCounts.tab
file: the gene id in the first, and the counts in the second. Therefore, I suggest that the defaults should be:
format: "custom"
id-column: 1
expression-column: 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some change suggestions from a different perspective than Eric's ;)
snappy_pipeline/workflows/somatic_neoepitope_prediction/Snakefile
Outdated
Show resolved
Hide resolved
mapper="star", | ||
library_name=rna_library, | ||
) | ||
ext = {"expression", "bam", "bai"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nesting is also pretty deep, perhaps it can be restructured (by breaking it into smaller chunks / moving common stuff into dedicated functions) a bit to make it a bit more palatable
snappy_pipeline/workflows/somatic_neoepitope_prediction/__init__.py
Outdated
Show resolved
Hide resolved
snappy_pipeline/workflows/somatic_neoepitope_prediction/__init__.py
Outdated
Show resolved
Hide resolved
snappy_pipeline/workflows/somatic_neoepitope_prediction/__init__.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Till Hartmann <[email protected]>
…__.py Co-authored-by: Till Hartmann <[email protected]>
…m:bihealth/snappy-pipeline into 472-adding-neoepitope-prediction-pipeline
9664352
to
bf39678
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many comments here are general musings about how to improve snappy
in general. We should discuss them at some point, and involve Till as much as possible.
) | ||
yield "combine_vcf", prepare_tpl | ||
hla_typing = self.parent.sub_workflows["hla_typing"] | ||
hla_tpl = "output/optitype.{ngs_library}/out/optitype.{ngs_library}.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the HLA caller tools shouldn't be hard-coded.
However, to keep it general, we should consider the possibility that some HLA caller tools require mapped data.
So there is a possible file naming issue there, and the code needs to do something like the pseudo-code below:
hla_caller_prefix = ""
# Test if the selected HLA caller gets its input from the ngs_mapping step
if "path_ngs_mapping" in self.w_config.step_config.hla_typing[self.config.hla_caller].keys():
hla_caller_prefix += f"{self.config.mapper}."
hla_caller_prefix += f"{self.config.hla_caller}"
This is ugly, likely to fail (I believe that path_ngs_mapping
is defined regardless of its need in the hla_typing
step model) and I don't like it.
Ideally, I think that each step should provide a facility to get the naming prefix, or something like that.
But perhaps there are better ideas, or perhaps we have to live with ugly solutions.
Ideas, comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the rules for tools must be put under an if
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have discussed with Till about this bug. The rule hla_typing_arcashla_run gets evaluated, even that arcashla is not defined. So this is a quick fix for this bug.
key, | ||
ngs_mapping( | ||
(rna_tpl + ext).format( | ||
mapper=self.w_config.step_config["ngs_mapping"].tools.rna[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case there are many tools to quantify expression, I think it's better to give the choice of mapper to the user, rather than take the first one.
Ideally, I would have a rna_mapper
entry in the configuration, so the user can choose the one she likes. If it's empty, then snappy
takes the first rna mapper defined the the ngs_mapping
configuration step.
def _get_output_files_prepare(self): | ||
if self.config["preparation"]["mode"] == "gene": | ||
prefix = ( | ||
"work/prepare/{mapper}.{var_caller}.{anno_caller}.GX.{tumor_library}/out/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to store the prepared vcf in work/{mapper}.{var_caller}.{anno_caller}.(GX|TX).{tumor_library}/prepare
.
It is more like the rest of the pipeline, where the rule is on subdirectory per sample in work
.
The only exceptions are input_links
and docker containers, R packages, & any files used during the creation of every output.
(Actually, when I think of it, there should not be a input_links
directory, it would be better to have work/xxx.{library}/input_links
, what do you think)
@dictify | ||
def _get_output_files_predict(self): | ||
key_ext = {"all_epitopes": ".all_epitopes.tsv", "filtered_epitopes": ".filtered.tsv"} | ||
prefix = "work/predict/{mapper}.{var_caller}.{anno_caller}.{mode}.{tumor_library}.epitopes/out/MHC_Class_I/{tumor_library}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, work/{mapper}.{caller}.{anno_caller}.pvacseq.{tumor_library}/out/{mapper}.{caller}.{anno_caller}.pvacseq.{tumor_library}.{mode}.MHC_Class_I
might be a better prefix.
Another comment: I am not too keen on including all tools in directories & filenames. If we want to follow the rule strictly, we should add the HLA typing tool, and the RNA mapper. We might also add adapter trimming, for example.
I think it is unmanageable to add all the tools, and we might need to think about alternative naming schemes.
Ideas, comments?
else "" | ||
) | ||
|
||
op_dir = "/".join(snakemake.output.all_epitopes.split("/")[:-2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace "/".join
with os.path.join
|
||
op_dir = "/".join(snakemake.output.all_epitopes.split("/")[:-2]) | ||
|
||
files_to_bind = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that currently you need to bind the any directory. My understanding is that the pvacseq
command requires only files within the step directory, both for input and output.
__author__ = "Pham Gia Cuong" | ||
__email__ = "[email protected]" | ||
|
||
step = snakemake.config["pipeline_step"]["name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
step
& config
should not be needed anymore, is that correct?
{maximum_transcript_support_level}" | ||
echo 'TMPDIR=/bindings/d2' > $TMPDIR/{snakemake.wildcards.tumor_library}.sh | ||
echo $cmd >> $TMPDIR/{snakemake.wildcards.tumor_library}.sh | ||
apptainer exec --home $PWD -B $TMPDIR:/bindings/d2 {bindings} {config[path_container]} bash /bindings/d2/{snakemake.wildcards.tumor_library}.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path to the container should be passed to the wrapper as a parameter.
{NORMAL_VAF}\ | ||
{exclude_NAs}\ | ||
{maximum_transcript_support_level}" | ||
echo 'TMPDIR=/bindings/d2' > $TMPDIR/{snakemake.wildcards.tumor_library}.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the command is written to a temporary script? Isn't it possible to pass it as a string?
Adding a draft workflow for somatic neoepitope preparation for using pvactool later