Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-tool functionality and subworkflows as hub of methods #385

Open
5 of 19 tasks
suzannejin opened this issue Dec 10, 2024 · 14 comments
Open
5 of 19 tasks

Multi-tool functionality and subworkflows as hub of methods #385

suzannejin opened this issue Dec 10, 2024 · 14 comments

Comments

@suzannejin
Copy link

suzannejin commented Dec 10, 2024

Goals

  • Make the pipeline possible to run multiple combination of tools (eg. limma + gProfiler2, DEseq2 + GSEA) at once, by:
    • Toolsheet definition
    • Properly handle the channels to do the multiple running
  • Create method class- based subworkflows, as a place to easily add new methods of the same kind:
    • Differential subworkflow that calls DE methods
    • Enrichment subworkflow that calls functional analysis methods
  • Add new methods

Context

There were some effort done in the branch dev-ratio to explore these options.
Now the plan is to break down the work into small pieces, clean code, and PR to dev.

Steps needed

Other related features

@suzannejin suzannejin converted this from a draft issue Dec 10, 2024
@grst
Copy link
Member

grst commented Dec 10, 2024

Regarding the "Toolsheet", how does that relate to what we proposed in #362?

@suzannejin suzannejin changed the title Multi-tool functionality Multi-tool functionality and subworkflows as hub of methods Dec 10, 2024
@suzannejin
Copy link
Author

suzannejin commented Dec 10, 2024

Regarding the "Toolsheet", how does that relate to what we proposed in #362?

The toolsheet is to decide which DE and functional analysis methods to run. An example is here. This is the default toolsheet where each row is a combination of tools that would make sense to be together.

The idea is that the user can select for example --pathway deseq2_gsea,limma_gprofiler2, then this will run both options at the same time with default parameters for each method (with the possibility to change the parameters by toolsheet or cmd flags).

As for your question, the method option in the contrast file could be a way to match between each contrast and the corresponding method to run.

@grst
Copy link
Member

grst commented Dec 10, 2024

I'm wondering if it wouldn't be more convenient to specify everything in yaml format? Essentially each list item would replace one row in your toolsheet and everything could be specified in one place. YAML seems the more natural choice to me in cases where you have a lot of empty columns in a CSV file otherwise and/or lists of things such as deseq2_gsea,limma_gprofiler2.

I'm also afraid that all the parameters for a differentialabundance run get scattered across too many places... nextflow params, contrasts file, toolsheet file, samplesheet... I'd rather reduce the number of places where to specify parameters.

Something like:

models: 
  - method: limma
    formula: ~ treatment + response
    contrasts:
      - id: treatment_a_vs_b
        type: simple
        comparison: ["treatment", "A", "B"]
    enrichment: 
      - gsea
      - gprofiler2    
  -  method: propd
     permutations: 100
     contrasts: 
      - id: treatment
        type: anova
        column: treatment
   - compositional: propr
      metric: rho

This obviously needs to be fleshed out in more detail. For this it would be important to understand which of the workflows depends on each other. I guess the compositional workflow is completely separate from the differential workflow. The enrichment workflow could be independent when working on the expression data, but it could also work off a ranked gene list generated by the differential workflow.

@suzannejin
Copy link
Author

suzannejin commented Dec 10, 2024

I'm wondering if it wouldn't be more convenient to specify everything in yaml format?

I don't have too much of a strong feeling between yaml or csv format. However, merging contrast with toolsheet into one file could become tricky. This is because, when there are many methods available, it is nice to have a 'default' toolsheet as a place to specify all the possible combinations of tools that really make sense to be together from the theoretical perspective. This file will always be there, in the pipeline github. Whereas the contrast file is data specific.

@grst
Copy link
Member

grst commented Dec 10, 2024

it is nice to have a 'default' toolsheet as a place to specify all the possible combinations of tools that really make sense to be together from the theoretical perspective

What are the implications of this? Would you fail the pipeline if a user specifies an "invalid" combination?

@suzannejin
Copy link
Author

suzannejin commented Dec 10, 2024

What are the implications of this? Would you fail the pipeline if a user specifies an "invalid" combination?

Don't have a plan for that yet, but one option is to raise a warning that it is a non-tested combination.

Indeed, for benchmark users, we considered the possibility of providing an extra toolsheet with all the rows one wants to benchmark.

@suzannejin
Copy link
Author

I'm also afraid that all the parameters for a differentialabundance run get scattered across too many places... nextflow params, contrasts file, toolsheet file, samplesheet... I'd rather reduce the number of places where to specify parameters.

This is also a concern for us... but for the moment we have not find a better solution. It would be nice to brainstorm at some point and super welcome to contribute if you find a better way :)

@grst
Copy link
Member

grst commented Dec 11, 2024

This file will always be there, in the pipeline github. Whereas the contrast file is data specific.

Just to clarify again, this will only be in the pipeline and the user specifies the combination of tools using standard params, e.g. --pathway deseq2_gsea,limma_gprofiler2? Or will this be an additional input file for the user?

@suzannejin
Copy link
Author

Just to clarify again, this will only be in the pipeline and the user specifies the combination of tools using standard params, e.g. --pathway deseq2_gsea,limma_gprofiler2? Or will this be an additional input file for the user?

We defined tools = "${projectDir}/assets/tools_samplesheet.csv" in nextflow.config.
In theory, users should not provide any additional toolsheet to run the pipeline, but we also don't want to stop the users doing so. Hence, one can still change tools path to a custom toolsheet under their own risk. Do you think this will be a problem?

@grst
Copy link
Member

grst commented Dec 11, 2024

No, it's all good then. All I wanted to know is that in a standard pipeline run, the user wouldn't be required to specify yet another config file.

As you said, we should still think about how to reduce the number of places where to specify parameters, but that's a topic for a separate issue.

@suzannejin
Copy link
Author

Here I created a meta issue with all the steps/sub-issues needed to achieve what we agreed to do.
Let me know what you think and if you would add/modify anything :)

CC @mirpedrol @bjlang @JoseEspinosa @pinin4fjords @WackerO

@mirpedrol
Copy link
Member

I'm wondering if it wouldn't be more convenient to specify everything in yaml format?

Since the tool sheet will be read with nf-schema, it can accept both CSV and YAML, so a user could use the one that is more convenient for them.

@suzannejin
Copy link
Author

suzannejin commented Dec 12, 2024

I'm wondering if it wouldn't be more convenient to specify everything in yaml format?

Actually @mirpedrol , if it is in yaml format, does it mean that it would be more flexible, and better allow definitions of optional methods/params?

@mirpedrol
Copy link
Member

I would say they are equivalent if we use simple YAML (without nesting), up to a user preference which one is easier to type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ToDO - subworkflows and multi-tool
Development

No branches or pull requests

3 participants