Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ext config as params #308

Closed
wants to merge 2 commits into from
Closed

Refactor ext config as params #308

wants to merge 2 commits into from

Conversation

bentsherman
Copy link

See nextflow-io/nextflow#4510 (comment)

tl;dr -- instead of implementing module config for the sole purpose of using ext as a shortcut to inject process inputs, let's just make them process inputs and control them with pipeline params as needed.

An interesting case with fetchngs is that the default args for SRATOOLS_FASTERQDUMP are overwritten by the subworkflow that uses it. For now I just left the default args in a comment, but it could also be hardcoded in the script body as a fallback value. Long term, it could be a default value with the declared input (as I have commented), likely as part of nextflow-io/nextflow#4553 .

Between this refactor and replacing publishDir with the workflow publish DSL, we wouldn't need the module config at all, at least for the use cases we have gathered so far.

Signed-off-by: Ben Sherman <[email protected]>
Copy link

github-actions bot commented Apr 24, 2024

nf-core lint overall result: Failed ❌

Posted for pipeline commit c67f91a

+| ✅ 154 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗   4 tests had warnings |!
-| ❌   4 tests failed       |-

❌ Test failures:

  • schema_params - Param aspera_cli_args from nextflow config not found in nextflow_schema.json
  • schema_params - Param sra_fastq_ftp_args from nextflow config not found in nextflow_schema.json
  • schema_params - Param sratools_fasterqdump_args from nextflow config not found in nextflow_schema.json
  • schema_params - Param sratools_pigz_args from nextflow config not found in nextflow_schema.json

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_exist - File not found: conf/igenomes.config
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml

❔ Tests ignored:

  • files_exist - File is ignored: conf/modules.config
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • actions_ci - actions_ci
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml
  • multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-05-01 15:59:39

SRATOOLS_PREFETCH.out.sra,
ch_ncbi_settings,
ch_dbgap_key,
params.sratools_fasterqdump_args ?: '',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should always be defined, as it's in nextflow.config, so no need for an elvis operator, no?

Or is this being defensive in case of this being a shared subworkflow and pipeline authors forgetting to add the param into the Nextflow config file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the fallback isn't really needed here. I think my intent was that the process should define a default value for this input e.g. val args, defaultValue: '' (or String args = '' once we have static types 😄 ). But that isn't supported yet so I put it here instead

@nvnieuwk
Copy link

I personally do agree that the ext.args method could be improved (especially concerning defaults stated by the pipeline developer etc). But I'm not sure this solution is the best since you lose a lot of flexibility (especially the ability to have sample specific arguments). Unless you know of an easy way on how to do it this using your method?

@mahesh-panchal
Copy link
Member

mahesh-panchal commented Apr 25, 2024

As Nicolas said, one of the reasons to use this was because of being able to use the meta.key inside the args, and closures are not supported on the command line or -params-file. All the stuff that goes in using the ext directive (args, prefix, etc) is mostly because of being able to use closures. Otherwise we're forcing to provide parameters using -c and there the config might not resolve correctly because of nextflow-io/nextflow#2662

I did initially suggest also supplying args as params too to make it less confusing to the user, but it was left out. nf-core/rnaseq#701, but again it didn't support closures from the command line.

Edit: Slack chat also suggests appending the meta map controlled stuff to the args, but then that means multiMaping channel inputs.

@FriederikeHanssen
Copy link

This feels like circling back to the very attempt at nf-core/modules. Another issue was that the passing around of all modules_args through several places in the pipeline was a bit painful: https://github.com/nf-core/sarek/blob/8ae205c171fb53132b2205b231a779347341a9f0/main.nf

@mahesh-panchal
Copy link
Member

nf-core subworkflows are also written such that all inputs come in through the channels (i.e. take:), including any values like true/false, etc.
e.g. https://github.com/nf-core/modules/blob/master/subworkflows/nf-core/vcf_annotate_ensemblvep_snpeff/main.nf
This could potentially bloat subworkflow inputs a lot.

@bentsherman
Copy link
Author

bentsherman commented Apr 25, 2024

Thank you all for the historical context I was missing. It's interesting how similar it is to the old DSL1 approach. My only difference would be to treat each args setting as its own param instead of bundling them in a map.

one of the reasons to use this was because of being able to use the meta.key inside the args

I would handle dynamic args like this:

        ASPERA_CLI (
            ch_sra_reads.aspera,
            'era-fasp',
            ch_sra_reads.aspera.map { meta, fastq -> "${meta.key}" }
        )

You could customize it further with params, e.g. to toggle between static and dynamic args. I agree that setting a param to a closure is an anti-pattern, so I'm not claiming it would be as powerful as the ext config. But I would also ask how much flexibility do you actually need? Are you just adding a lot of flexibility because you can, or are you addressing real use cases that can't be solved in other ways?

This could potentially bloat subworkflow inputs a lot

I think it's better to be transparent about the complexity of the workflow, rather than try to hide it with shortcuts like ext. It forces you to confront the true complexity of your pipeline and consider more broadly whether the complexity is justified and how else you might address it in the pipeline code.

Admittedly, this sort of dynamic ext probably shouldn't have been allowed in the first place, and I can understand how it might be hard to like any other solution when ext config already exists. But at the end of the day, these ext args are inputs, not config. I know this because you (Mahesh) requested that ext settings invalidate the cache: nextflow-io/nextflow#2382 . Process directives normally don't affect the cache because they define how a task is executed rather than what is executed and produced.

I am also thinking about this in the context of all the other language improvements we are working on -- formal script and config grammar, record types, incorporating the params schema, workflow publish definition, etc. All of these efforts work together to make the language much simpler and make many bad patterns unnecessary. I suspect that the use of ext config is driven in part by a desire to avoid the pain and messiness of writing pipeline code... so if the pipeline code is made simpler, maybe the transition to params / inputs doesn't seem so bad.

@mahesh-panchal
Copy link
Member

I just want to say that supplying process args through params is something I think we should be doing. I'm just wary of making the pipelines more difficult to read, as that's a big factor in getting new potential contributors.

That tool specific parameters can come through ext.args is a barrier to new users, and new developers (at least to the people I talk to).

Currently setting ext.args through profiles like test requires "tricks" such as using the full process name so it has higher priority than the modules.config which is loaded after the profiles ( so params get resolved correctly ), or adding extra params like extra_star_align_args and supplying that to the ext.args ( so we end up making params anyway ).

Any workflow specific optional options (say, tool flags necessary for RNAseq data, but are not mandatory for the nf-core module ) can still be mistakenly changed by the user. Passing them as a process input would help mitigate this, since the workflow developer can hard code these extra options into the workflow. As you show, it's fairly clear using a map. I was thinking this might be more cumbersome and would lead to more multiMaps everywhere which makes the code harder to read.

But I would also ask how much flexibility do you actually need? Are you just adding a lot of flexibility because you can, or are you addressing real use cases that can't be solved in other ways?

At the time, I was just looking for ways to cut down the code, because it was very hard for me to follow what was going where. I'm sure we must have discussed passing these args as an input val too, but I can't find anything in the archives. Passing as a process input still provides the same flexibility as ext.args since we're referencing those variables anyway and the majority comes from the meta map (I was mistaken it being an issue). An issue is then ext.prefix and the other ext options that we use. There's no reason this can be addressed in the same way or referenced from the meta map in the module. I think the majority of prefixes are set by the developer and kept that way. There's little (no?) need for the user to change it. ( The only side note is ext.when but that was for stuff in ext.args as it wasn't visible in the workflow scope to use if or branch which makes it unnecessary if passed as a process input - although it's other use is running parts of a pipeline, but that can be overcome with more params ).

Sorry for the rambling. So in summary:
Some of these solutions came about because they didn't add code to the main workflow. It was already difficult enough to track inputs, and adding them into the process and subworkflow calls was adding that tracking difficulty back in and adding more code again. Transitioning to params / inputs never sounded like a bad thing to do. I think my main worry is the passing through all the subworkflows, and making input tracking harder to follow again ( which may not be the case ). But anything you can do to make writing a pipeline simpler is always welcome.

@bentsherman
Copy link
Author

It's a fair point about the readability of the pipeline code. I think that proper IDE tooling and better static type support will help a lot here, so many of these changes I'm pushing for are predicated on the overall developer experience being much better. I also have some ideas to simplify the channel logic, potentially by a lot, but that is longer term.

I'm working on a "visionary" PR for fetchngs which combines a bunch of different language improvements that are in development so that I can get a clearer picture of everything, and maybe also show you guys where I'm trying to go.

As for this PR, it's something you can implement in your pipelines today, if you want. It's totally up to you whether you'd rather have this complexity in the pipeline code / params or in the config. But I will say, while I don't think we will remove anything anytime soon, we aren't going to make the config approach any easier. The ext directive, config selectors, those things aren't gonna get any prettier, but the pipeline code will.

@mahesh-panchal
Copy link
Member

Sorry more ramblings related to this.

So one reason why

process {
    withName: 'BOWTIE_ALIGN' {
        ext.args = "--some-flag"
    }
}

is so nice to use is that I can see exactly where this input is applied.

Using params.extra_star_align_args, it's not clear where this is going unless I search through the code.

Concrete issue: nf-core/rnaseq#1111
Here, independently, two users wanted to add the same flag to input through --extra_salmon_quant_args, but weren't getting desired results.

I'm not familiar with the current state of nf-core/rnaseq, but as a contributor, I had an easy time to trace why the flag didn't work (and could provide a low effort solution). In the parameters as process inputs solution this may have been hard-coded as a string input to the process. This means the user has no way to change this unless they make an issue, make a PR, change the code themselves, or just leave it. I'm not sure how often this occurs, but more generally, many users also prefer to be able to find what they need through the documentation (sidenote: it would be monumental for a developer to code for every param possibility too). I see colleagues that will get as far as they can with a workflow, and if they can't achieve something, they'll run the tool outside the workflow with the flags they want. They won't make an issue, PR, or interact with developers to get the feature they need added because it's a time cost.

Another thing is parameter naming. The process selectors are nice because the code enforces a 1-to-1 relationship. It would be nice if we could achieve something similar through a schema or some config feature that could achieve a similar kind of behaviour. I would like to be able to see where my param is likely going based on it's name, and there be some mechanism to warn if there isn't a one to one correspondence like in the case of the --extra_salmon_quant_args issue above.

Sorry, there's a few things to untangle here.

@bentsherman
Copy link
Author

Process selectors are a convenient syntax sugar and a quick fix for this particular use case, but it's also very brittle -- the selector name is easy to get wrong, the priority resolution is easy to get wrong, dynamic behavior is very limited because you aren't in the pipeline code yet. I wonder whether the quick fix that it provides is offset by poor developer experience that it creates over the long term.

But I agree that the traceability is nice, and I would support making it easier to trace params through the pipeline code. I would like to do exactly this with the workflow DAG -- show each param and how it connects to processes, without necessarily showing all of the glue logic it has to go through. With the IDE tooling, I could see both (1) side-by-side code with DAG preview that updates as you type and (2) hover over a param to see the processes that it connects to.

Right now you have to run the pipeline (even if in preview mode) just to render the DAG. But if we can produce the DAG just from parsing the script, we can get better information (e.g. about the params) and do it in real-time (i.e. live preview).

@bentsherman bentsherman mentioned this pull request Apr 29, 2024
6 tasks
@mahesh-panchal
Copy link
Member

mahesh-panchal commented Apr 29, 2024

Can you document somewhere that

        ASPERA_CLI (
            ch_sra_reads.aspera,
            'era-fasp',
            ch_sra_reads.aspera.map { meta, fastq -> "${meta.key}" }
        )

doesn't unsync the inputs?

offset by poor developer experience that it creates over the long term

I'm not sure what you mean exactly by this, but you've also got to remember that this small change creates a lot of maintenance burden (also part of developer experience imo) on the contributors here. Short term, the core team has to organise the change across the modules, which includes documentation, tooling, and trying to think what might break. Then manage all the PR's to the modules and subworkflows as well as finally rolling it out in all the pipelines. Then there's the long term maintenance of dealing with anything that is missing from the conversion. It took several months and a lot of work from everyone to roll out what I proposed back then.

The possibility for passing through ext.args has to be removed ( DSL3? ) if you want to force the switch or more people really need to speak up in favour of passing through params given the work it entails. Thanks for taking the time to make the improvements. I like the ideas that are coming forward and I'm looking forward to seeing the DSL3 info.

@bentsherman
Copy link
Author

Here's my proof-of-concept for DSL2+ / DSL3 with fetchngs: #309

Code changes like this always take a long time, but I'm in no hurry. We aren't going to drop support for ext or process selectors any time soon. If nothing else, DSL3 can be a good point at which to change a bunch of stuff at once, and I don't expect it will be as radical as DSL2, more like a refinement.

@mahesh-panchal
Copy link
Member

Pieter Moris just (re-)discovered an interesting problem of passing command-line options on the command-line:
https://nfcore.slack.com/archives/CE6SDBX2A/p1715779745188559?thread_ts=1715768359.012549&cid=CE6SDBX2A

--fastp_extra_args '--trim_poly_x' # interpreted as --trim_poly_x: true at the nextflow level
--fastp_extra_args '--trim_poly_x -y' # both args are interpreted by fastp only

It's solved by using a params-file or config file, but there are a lot who will pass using the command-line.

@bentsherman
Copy link
Author

This is because the Nextflow CLI inserts true after the first param? That is tricky. Even the CLI v2 I expect will have this problem, though it should be avoidable by using =:

--fastp_extra_args='--trim_poly_x'

But I think this will be fully solved when we incorporate the parameter schema into core Nextflow. Then the CLI will know that this param is a string and take the following string argument instead of treating it as a boolean.

@bentsherman
Copy link
Author

Folding into #312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants