Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added samshee #6749

Merged
merged 16 commits into from
Oct 16, 2024
Merged

Added samshee #6749

merged 16 commits into from
Oct 16, 2024

Conversation

nschcolnicov
Copy link
Contributor

Addressing nf-core/demultiplex#268

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@nschcolnicov nschcolnicov marked this pull request as ready for review October 7, 2024 18:12

input:
tuple val(meta), path(samplesheet)
val(validator_schema) //optional
Copy link
Member

@grst grst Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky to fully support samshee's range of inputs here.
Samshee can accept either

  • a schema name e.g. urn:samshee:illuminav2/v1
  • a json string, e.g. {"required": ["BCLConvert_Data"]}
  • a URI pointing to a local file, e.g. {"$ref": "file:./test.schema.json"}
  • a URI pointing to a remote file, e.g. '{"$ref": "https://dataportal.lit.eu/schemas/litngscoresamplesheet/v0.1/litngscoresamplesheet.schema.json"}'

plus it can accept multiple schemas. Especially when referencing a local file, value channels are not ideal, as when working with a cloud-based solution, the file needs to be provisioned by nextflow. And even when it's a remote URL, it might be preferable to have the download handled by nextflow.

Maybe it could work to have two input channels, one accepting value, and one accepting files, such that a call to the samshee module looks like

SAMSHEE(
    [['id': 'samplesheet01'], path("samplesheet.csv")],
    ['{"required": ["Data"]}', '{"$ref": "urn:samshee:illuminav2/v1"}'],
    [path("file:./test.schema.json"), path('https://dataportal.lit.eu/schemas/litngscoresamplesheet/v0.1/litngscoresamplesheet.schema.json')]
)

WDYT?

)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @grst I think this makes sense, I'll work on it

@grst
Copy link
Member

grst commented Oct 8, 2024

Here's a testcase. Note that this is a V1 samplesheet, see also my comment here: nf-core/demultiplex#268 (comment)

[Header],,,,,,,,
IEMFileVersion,4,,,,,,,
Experiment Name,241003_RNASeq_FFPE,,,,,,,
Date,03.10.2024,,,,,,,
Workflow,GenerateFASTQ,,,,,,,
Application,FASTQ Only,,,,,,,
Assay,TruSeq HT,,,,,,,
Chemistry,Amplicon,,,,,,,
,,,,,,,,
[Reads],,,,,,,,
Read1Cycles,102,,,,,,,
Read2Cycles,102,,,,,,,
Index1Cycles,8,,,,,,,
Index2Cycles,8,,,,,,,
,,,,,,,,
[Settings],,,,,,,,
,,,,,,,,
[Data],,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project
101_xxxx,,,A1,S762,TTACCGAC,S512,CGTATTCG,
102_xxxx,,,B1,S713,TCGTCTGA,S586,TCAAGGAC,
103_xxxx,,,C1,S736,TTCCAGGT,S543,AAGCACTG,
104_xxxx,,,D1,S709,TACGGTCT,S575,GCAATGGA,

modules/nf-core/samshee/environment.yml Outdated Show resolved Hide resolved
modules/nf-core/samshee/main.nf Outdated Show resolved Hide resolved
modules/nf-core/samshee/tests/main.nf.test Outdated Show resolved Hide resolved
modules/nf-core/samshee/tests/nextflow.config Outdated Show resolved Hide resolved
modules/nf-core/samshee/tests/main.nf.test Outdated Show resolved Hide resolved
@nschcolnicov
Copy link
Contributor Author

nschcolnicov commented Oct 10, 2024

@grst Some updates on this:
Using --output-format sectioned on the samplesheet doesn't work on all v1 samplesheets. For example, it didn't work on the one you provided or the one from the test profile. However, it did work on the one from the test_full profile: https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/miseq_35147139/miseq_35147139_samplesheet.csv

Regarding having a different channel for local and remote json files, it shouldn't be needed since, as long as we stage the remote file, it will always be local

@nschcolnicov
Copy link
Contributor Author

nschcolnicov commented Oct 10, 2024

@grst I added the additional features you requested, let me know what you think! @atrigila Also implemented your suggestions, I'm waiting on the bioconda package https://github.com/bioconda/bioconda-recipes/pull/51315/checks?check_run_id=31375653510

Comment on lines 24 to 25
def arg_json_schema_validator = json_schema_validator ? "--schema '${json_schema_validator}'" : ""
def arg_name_schema_validator = name_schema_validator ? "--schema '${name_schema_validator}'" : ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could either merge these two into one channel, then the user would need to specify '{"$ref": "urn:samshee:illuminav2/v1"} or I'd ajust the second line as follows:

Suggested change
def arg_json_schema_validator = json_schema_validator ? "--schema '${json_schema_validator}'" : ""
def arg_name_schema_validator = name_schema_validator ? "--schema '${name_schema_validator}'" : ""
def arg_json_schema_validator = json_schema_validator ? "--schema '${json_schema_validator}'" : ""
def arg_name_schema_validator = name_schema_validator ? "--schema '{\"\$ref\": \"${name_schema_validator}\"'" : ""

I have a slight preference for the single channel version as it keeps the module simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with single channel version. I would suggest (untested):

def arg_schema = json_schema_validator ? "--schema '${json_schema_validator}'" :  (task.ext.args?.contains("--schema") ? "" : "")

Copy link
Contributor Author

@nschcolnicov nschcolnicov Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grst I also thought of having the user just specify just the name, but I thought it could add a layer of confusion. I did it this way so that they can just use exactly the same strings that are displayed in the samshee readme. Doing it this way we would need to add an explanation saying that they need to extract the string from the readme that comes after the "$ref:", do you think it would be easier for them to understand?

I went with the approach of not having the channels combined into a tuple to make it easier to avoid issues regarding optional inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, let's go with a single channel for schemas specified as string then (+ the second one for paths).

def arg_json_schema_validator = json_schema_validator ? "--schema '${json_schema_validator}'" : ""
def arg_name_schema_validator = name_schema_validator ? "--schema '${name_schema_validator}'" : ""
def arg_file_schema_validator = file_schema_validator ? "--schema '{\"\$ref\": \"file:${file_schema_validator}\"}'" : ""
def arg_v1_schema = params.v1_schema ? "--output-format sectioned" : ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a module should ever (?) access a global param. So this should either be task.ext.v1_schema, or just rely on the user to provide the "sectioned" flag via task.ext.args when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I would supply it via task.ext.args

modules/nf-core/samshee/main.nf Show resolved Hide resolved
Comment on lines 6 to 8
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'docker://community.wave.seqera.io/library/pip_samshee:733e11f3377fc2e3' :
'community.wave.seqera.io/library/pip_samshee:733e11f3377fc2e3' }"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a comment with the latest guidelines on incorporating seqera containers to modules:

https://nfcore.slack.com/archives/CJRH30T6V/p1728901074405529

Check the container settings for singularity (steps 8-11), I think they require https.

Comment on lines 12 to 13
val(json_schema_validator) // optional
val(name_schema_validator) // optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All non-mandatory command-line tool non-file arguments MUST be provided as a string via the $task.ext.args variable. (https://nf-co.re/docs/guidelines/components/modules#optional-command-arguments)

Comment on lines 24 to 25
def arg_json_schema_validator = json_schema_validator ? "--schema '${json_schema_validator}'" : ""
def arg_name_schema_validator = name_schema_validator ? "--schema '${name_schema_validator}'" : ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with single channel version. I would suggest (untested):

def arg_schema = json_schema_validator ? "--schema '${json_schema_validator}'" :  (task.ext.args?.contains("--schema") ? "" : "")

def arg_json_schema_validator = json_schema_validator ? "--schema '${json_schema_validator}'" : ""
def arg_name_schema_validator = name_schema_validator ? "--schema '${name_schema_validator}'" : ""
def arg_file_schema_validator = file_schema_validator ? "--schema '{\"\$ref\": \"file:${file_schema_validator}\"}'" : ""
def arg_v1_schema = params.v1_schema ? "--output-format sectioned" : ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I would supply it via task.ext.args

Comment on lines 32 to 34
$arg_json_schema_validator \
$arg_name_schema_validator \
$arg_file_schema_validator \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace for a single arg_schema that will be populated depending on whether there is an input json file OR the ext.args contain --schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I follow, so users can use many of arg_json_schema_validator and arg_name_schema_validator and arg_file_schema_validator. Aren't they mutually exclusive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't they mutually exclusive?

No, they are not. The user could for instance specify the '{"$ref": "urn:samshee:illuminav2/v1"} schema, and additionally validate against a custom schema provided as json file.

modules/nf-core/samshee/main.nf Outdated Show resolved Hide resolved
modules/nf-core/samshee/main.nf Outdated Show resolved Hide resolved
@nschcolnicov
Copy link
Contributor Author

nschcolnicov commented Oct 15, 2024

@grst @atrigila We will be passing all string values through the ext.args, so there is no longer a need to combine the channels. I also removed the custom ext.args since this is also not good practices, and we can organize all of these parameters through the modules.config.
Linting for the module fails due to the already reported issue about having pip in the environment.yml

Note:
I already started a PR on demultiplex, so I could test these changes, you can look here how this module would fit into the pipeline, as well as how we can set it up: https://github.com/nf-core/demultiplex/pull/275/files#diff-dfadb23810ec7ce446f121e68f79a112e361d6a833cce1701224a89295c7b8d6
I still need to merge this PR first and then install it into demultiplex using nf-core tooling, but just so that you get a peak.

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think file schemas are currently not respected, see my code suggestion.
LGTM otherwise!

modules/nf-core/samshee/main.nf Show resolved Hide resolved
Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@atrigila atrigila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Thanks for addressing our points :)

Comment on lines 7 to 8
- pip: # FIXME https://github.com/nf-core/modules/issues/5814
- samshee==0.2.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask in slack what we can do with this as we will need a core team member to approve this PR

description: "illumina v2 samplesheet"
pattern: "*.{csv}"
- - file_schema_validator:
type: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't type: file here as well?

e.g. [ id:'test', lane:1 ]
- "*_formatted.csv":
type: file
description: "illumina v2 samplesheet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter the input (v1 and v2) the output is always going to be a v2 samplesheet?

Comment on lines +14 to +18
params {
v1_schema = true
json_schema_validator = '{"required": ["Data"]}'
name_schema_validator = null
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally pass these via ext.args and using different configs, but this works anyway 👍 .

e.g. [ id:'test', lane:1 ]
- samplesheet:
type: file
description: "illumina v2 samplesheet"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only v2 samplesheet? or could it be v1 as well?

@nschcolnicov nschcolnicov added this pull request to the merge queue Oct 16, 2024
Merged via the queue into nf-core:master with commit 3c464e7 Oct 16, 2024
12 checks passed
@nschcolnicov nschcolnicov deleted the update_samshee branch October 16, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants