Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev-ratio] add deseq2 and gprofiler2 into the experimental branch #306

Closed
wants to merge 6 commits into from

Conversation

suzannejin
Copy link

@suzannejin suzannejin commented Oct 22, 2024

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/differentialabundance branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Oct 22, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 4343c6b

+| ✅ 320 tests passed       |+
#| ❔   7 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-10-23 17:14:38

@suzannejin suzannejin changed the title [dev-ratio] add gprofiler2 into the experimental branch [dev-ratio] add deseq2 and gprofiler2 into the experimental branch Oct 22, 2024
@@ -490,7 +490,7 @@ process {
// However, later we need to better handle this, maybe by a bit of groovy scripting to
// overwrite the repeated parameters (?)

withName: "PROPR"{
withName: PROPR {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
withName: PROPR {
withName: 'PROPR' {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency I'd use ' throughout the config

Comment on lines +629 to +632
(meta.diff_method ? "-from-${meta.diff_method}" : ''),
(meta.args_diff ? "-${meta.args_diff.replace('--','').replace(' ', '_')}" : ''),
(meta.cor_method ? "-from-${meta.cor_method}" : ''),
(meta.args_cor ? "-${meta.args_cor.replace('--','').replace(' ', '_')}" : ''),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With splitting the toolsheet information into only the subworkflow relevant chunks, the data accessed here wouldn't be existing anymore. I'm not sure such an elaborate file naming is necessary in this case - could it not be reduced to the pathway name and args_enr? (I dont see a reason for 2 different pathways with identical args_diff, args_cor and args_enr to exist)

Copy link
Author

@suzannejin suzannejin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you have two pathways:

  1. deseq2 + gprofiler2
  2. deseq2 + gsea

Would you save same deseq2 output twice, but into two different pathway directories?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking only about the enrichment methods like the GREA process here, where you would access not only enr arguments but also need to pass over all other args from previous steps to build the filename in the way you suggest. So, what I mean is to use only args_enr and if this is not specific enough maybe additionally the pathway_name.
For deseq2 in the differential analysis step using args_diff to build the name is fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now that we don't have the pathway name info inside the meta anymore, we don't have any way to distinguish between them and store them into different places (?)

pathway = "propd,propd_grea,propr,cor"

// analysis
pathway = "propd,propd_grea,propd_ora,propr,cor,deseq2_ora"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think propd_fdr and pcorbshrink can be safely omitted from the test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

propd_fdr is currently a bit useless for this test, because it is not giving any significant results for the test dataset, so it can be ignored.
But I would have to check other datasets for it to run and test

Then instead of pcorbshrink maybe I would ignore cor

Comment on lines +5 to +6
include { DESEQ2_DIFFERENTIAL as DESEQ2 } from '../../../modules/nf-core/deseq2/differential/main'
include { FILTER_DIFFTABLE as FILTER_DESEQ2 } from '../../../modules/local/filter_difftable'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer to use renaming only when necessary, i.e. when a module is included multiple times. That way you always directly know for nf-core modules in which folder the module code is located, without needing to check the include statement.

Comment on lines +62 to +77
ch_results_genewise_filtered
.filter { it[0]["enr_method"] == "gprofiler2" }
.combine(ch_gene_sets)
.combine(ch_counts)
.multiMap { meta_results, results, meta_gene_sets, gene_sets, meta_counts, counts ->
de : [meta_results, results]
gmt : [gene_sets]
background : [counts]
}
.set{ ch_enrichment_gprofiler2 }

GPROFILER2_GOST(
ch_enrichment_gprofiler2.de,
ch_enrichment_gprofiler2.gmt,
ch_enrichment_gprofiler2.background
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason for the multiMap here.

Why not something like

Suggested change
ch_results_genewise_filtered
.filter { it[0]["enr_method"] == "gprofiler2" }
.combine(ch_gene_sets)
.combine(ch_counts)
.multiMap { meta_results, results, meta_gene_sets, gene_sets, meta_counts, counts ->
de : [meta_results, results]
gmt : [gene_sets]
background : [counts]
}
.set{ ch_enrichment_gprofiler2 }
GPROFILER2_GOST(
ch_enrichment_gprofiler2.de,
ch_enrichment_gprofiler2.gmt,
ch_enrichment_gprofiler2.background
)
GPROFILER2_GOST(
ch_results_genewise_filtered.filter { it[0]["enr_method"] == "gprofiler2" },
ch_gene_sets.first()[1],
ch_counts.first()[1]
)

Comment on lines +33 to +45
// parse optional input files that affect the normalization
// TODO we should consider to put this kind of stuff in a separate data handling / preprocessing / normalization block
if (params.control_features) {
ch_control_features = Channel.of([ [ "id": params.study_name ], file(params.control_features, checkIfExists: true)]).first()
} else {
ch_control_features = [[],[]]
}
if (params.transcript_length_matrix) {
ch_transcript_lengths = Channel.of([ [ "id": params.study_name ], file(params.transcript_length_matrix, checkIfExists: true)]).first()
} else {
ch_transcript_lengths = [[],[]]
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is Deseq2 specific, wouldn't it be better to move it into the differential_analysis subworkflow?

Copy link
Author

@suzannejin suzannejin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is good to call params within the subworkflows? I thought that should be avoided for a better encapsulation of the subworflow?
Also, not now, but in the future, I think it make sense to have two separated blocks with DESEQ2_NORM dealing the normalization of the data (so in the data processing block) that will use these control/transcript length features, and DESEQ2_DIFFERENTIAL running the differential analysis on the already normalized data

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I see your point, but I would argue that we have a better encapsulation, when a subworkflow can be so transparently changed (i.e. by adding a new module) that there is no need for making any changes to the calling workflow.

@suzannejin suzannejin closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants