-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ability to auto-create samplesheet for nf-core/phageannotator #543
base: dev
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran out of time before baby came back so couldn't test, but first pass: very clean code thank you very much @CarsonJM !
Couple of things missing (docs atm):
- Missing changelog update
- Missing update to README
- Missing input/output docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly more lucid feedback, sorry didn't think of these before
@nf-core-bot fix linting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking thoughts:
- Q: I wonder if we should put this into an umbrella workflow rather than the if/else statement
- Q: The
test
profile should ideally betest_generatesamplesheet
rather than justtest_samplesheet
(someone might think it's to use an input samplesheet) - Q: Are you definitely sure people would always want to collapse all assemblies per sample (no cases they would want to keep it separate)
Co-authored-by: James A. Fellows Yates <[email protected]>
Regarding your questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK overall this looks good to me now - the only thing I don't like is the paths to work
directories for the files - I strongly feel that it should be the relevant subdirectory of params.outdir
+ filename.
Co-authored-by: James A. Fellows Yates <[email protected]>
// Create samplesheet for each sample using meta information | ||
ch_mag_id_samplesheets = ch_mag_metadata.collectFile() { meta -> | ||
[ "${meta.id}_phageannotator_samplesheet.csv", "sample,group,fastq_1,fastq_2,fasta" + '\n' + "${meta.id},${meta.group},${meta.fastq_1},${meta.fastq_2},${meta.fasta}" + '\n' ] | ||
} | ||
|
||
// Merge samplesheet across all samples for the pipeline | ||
ch_mag_id_samplesheets.collectFile(name: "phageannotator_samplesheet.csv", keepHeader:true, skip:1, storeDir:"${params.outdir}/downstream_samplesheets/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfy133 Just to clarify, are you thinking I should add the storeDir:"${params.outdir}/downstream_samplsheets/"
to the first collectFile()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, more don't bind just fasta
to the string on the line above.
params.outdir + <path I don't remember> + fasta.name
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the FastQ files, it's possible that the users don't specify to save them to outdir, so should part of the generate_downstream_samplesheet be to copy the files to outdir even if not specified elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even just validate that if you turn on samplesheet creation, then you must also save the fastqs.
But yes, one way or another - running from explicitly saved fsstqs in outdir is preferable.
If you have cleanup = true
in a config, something many do but the user never sees, they won't have any fastq files in the work dir to point to in the sample sheet 😬. This is harder to control for rather than directing the user to make sure to save the fsstqs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me!
file(meta.fastq_1.toUriString(), checkIfExists: true).copyTo("${params.outdir}/downstream_samplesheets/fastq/${meta.fastq_1.name}") | ||
file(meta.fasta, checkIfExists: true).copyTo("${params.outdir}/downstream_samplesheets/fasta/${meta.fasta.name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking for some advise about this, this is indeed very clean, but given it's outside a process/publishdir I'm a bit unsure. E.g. some people use symlinks from work to results dir rather than copy, so this would violate this.
Secondly this woudl potentially result in two copies of the same read files if one of the --save_*
parameters are given.
It maybe we have to come up with some complicated logic that instead picks the right 'final' directory for the proceessing reads, and it's that that you append to the beginning, and then add.
I think this would come with two steps:
-
Make sure one of
--save_clipped_reads
,--save_hostremoved_reads
,--save_phixremoved_reads
,--save_bbnorm_reads
are selected ifparams.generate_samplesheet
is trueSo some input validation code like:
if ( params.generate_samplesheet && ![params.saved_clipped, save.hostremoved <...>].any() ) { Nextflow.error('[nf-core/mag] ERROR: must at least save one XYX if --generate_samplesheet <...>') }
-
Then have to have some complicated logic to select which directory gets appended to the beginning of the file name (probably best in a case when but w/e), e.g.
if (params.save_clipped_reads && !params.save_phix && !params.save_host_removed && !params._save_bbnorm) {
samplehseet_reads_dir = /subdir/clipedreads/saved/in
} else if ( params.save_phix && !params.save_host_removed && !params._save_bbnorm){
samplesheet_reads_dir = /subdir/phixreads/saved/in
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally see what you're saying and was curious what thoughts would be about doing this outside a process. I am more than happy change it to what you're suggesting!
The reason I was giving this approach a try was I was hoping to find a generalizable approach that would work across pipelines/updates even if the save input logic you're talking about gets modified (also, I selfishly wanted to find a quicker approach haha) Also, I was hoping to force a copy because in the past I've symlinked to publishDir thinking I copied and had my reads disappear because my workDir is in a scratch directory 😅.
Still, I definitely have though about what you're saying and will make that change if you think it's best!
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).