Merge pull request #236 from atrigila/add_samplesheet_generation

nf-core · Aug 10, 2024 · 8353819 · 8353819
2 parents e3d6e94 + 16e6b1a
commit 8353819
Show file tree

Hide file tree

Showing 19 changed files with 224 additions and 40 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - [#220](https://github.com/nf-core/demultiplex/pull/220) Added kraken2.
 - [#221](https://github.com/nf-core/demultiplex/pull/221) Added checkqc_config to pipeline schema.
 - [#225](https://github.com/nf-core/demultiplex/pull/225) Added test profile for multi-lane samples, updated handling of such samples and adapter trimming.
+- [#236](https://github.com/nf-core/demultiplex/pull/236) Add samplesheet generation.
 
 ### `Changed`
 

diff --git a/docs/output.md b/docs/output.md
@@ -21,6 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [Falco](#falco) - Raw read QC
 - [md5sum](#md5sum) - Creates an MD5 (128-bit) checksum of every fastq.
 - [kraken2](#kraken2) - Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to sequence reads.
+- [samplesheet](#samplesheet) - Samplesheet generation for downstream nf-core pipelines.
 - [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline
 
 ### bcl-convert
@@ -204,6 +205,16 @@ Creates an MD5 (128-bit) checksum of every fastq.
 
 [Kraken](https://ccb.jhu.edu/software/kraken2/) is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.
 
+### Downstream pipeline samplesheet
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `<outputdir>/samplesheet/`
+  - `*.csv`: Samplesheet with the generated FASTQ files formatted according to the selected downstream nf-core pipeline. Default: rnaseq format.
+
+</details>
+
 ### Adapter sequence removal from samplesheet
 
 <details markdown="1">

diff --git a/docs/usage.md b/docs/usage.md
@@ -6,17 +6,23 @@
 
 ## Introduction
 
-## Samplesheet input
+> [!IMPORTANT]
+> It is relevant to distinguish between the _pipeline_ samplesheet and the _flowcell_ samplesheet before working with this pipeline.
+>
+> - The **_pipeline_ samplesheet** is a file provided as input to the nf-core pipeline itself. It contains the overall configuration for your run, specifying the paths to individual _flowcell_ samplesheets, flowcell directories, and other metadata required to manage multiple sequencing runs. This is the primary configuration file that directs the pipeline on how to process your data.
+> - The **_flowcell_ samplesheet** is specific to a particular sequencing run. It is typically created by the sequencing facility and contains the sample information, including barcodes, lane numbers, and indexes. The typical name is `SampleSheet.csv`. Each demultiplexer may require a different format for this file, which must be adhered to for proper data processing.
 
-You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below. The input samplesheet is a comma-separated file that contains four columns: `id`, `samplesheet`, `lane`, `flowcell`.
+## Pipeline samplesheet input
 
-When using the demultiplexer fqtk, the samplesheet must contain an additional column `per_flowcell_manifest`. The column `per_flowcell_manifest` must contain two headers `fastq` and `read_structure`. As shown in the [example](https://github.com/fulcrumgenomics/nf-core-test-datasets/blob/fqtk/testdata/sim-data/per_flowcell_manifest.csv) provided each row must contain one fastq file name and the correlating read structure.
+You will need to create a _pipeline_ samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below. The input _pipeline_ samplesheet is a comma-separated file that contains four columns: `id`, `samplesheet`, `lane`, `flowcell`.
+
+When using the demultiplexer fqtk, the _pipeline_ samplesheet must contain an additional column `per_flowcell_manifest`. The column `per_flowcell_manifest` must contain two headers `fastq` and `read_structure`. As shown in the [example](https://github.com/fulcrumgenomics/nf-core-test-datasets/blob/fqtk/testdata/sim-data/per_flowcell_manifest.csv) provided each row must contain one fastq file name and the correlating read structure.
 
 ```bash
---input '[path to samplesheet file]'
+--input '[path to pipeline samplesheet file]'
 ```
 
-### Full samplesheet
+#### Example: Pipeline samplesheet
 
 ```csv title="samplesheet.csv"
 id,samplesheet,lane,flowcell
@@ -29,17 +35,15 @@ DDMMYY_SERIAL_NUMBER_FC3,/path/to/SampleSheet3.csv,3,/path/to/sequencer/output3
 | Column        | Description                                                                                                                                         |
 | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `id`          | Flowcell id                                                                                                                                         |
-| `samplesheet` | Full path to the `SampleSheet.csv` file containing the sample information and indexes                                                               |
+| `samplesheet` | Full path to the _flowcell_ `SampleSheet.csv` file containing the sample information and indexes                                                    |
 | `lane`        | Optional lane number. When a lane number is provided, only the given lane will be demultiplexed                                                     |
 | `flowcell`    | Full path to the Illumina sequencer output directory (often referred as run directory) or a `tar.gz` file containing the contents of said directory |
 
-An [example samplesheet](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv) has been provided with the pipeline.
+An [example _pipeline_ samplesheet](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv) has been provided with the pipeline.
 
 Note that the run directory in the `flowcell` column must lead to a `tar.gz` for compatibility with the demultiplexers sgdemux and fqtk.
 
-Each demultiplexing software uses a distinct samplesheet format. Below are examples for demultiplexer-specific samplesheets. Please see the following examples to format `SampleSheet.csv` for [sgdemux](https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/sim-data/out.sample_meta.csv), [fqtk](https://github.com/fulcrumgenomics/nf-core-test-datasets/raw/fqtk/testdata/sim-data/fqtk_samplesheet.csv), and [bcl2fastq and bclconvert](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv)
-
-### Samplesheet for fqtk
+#### Example: Pipeline samplesheet for fqtk
 
 ```csv title="samplesheet.csv"
 id,samplesheet,lane,flowcell,per_flowcell_manifest
@@ -52,17 +56,30 @@ DDMMYY_SERIAL_NUMBER_FC3,/path/to/SampleSheet3.csv,3,/path/to/sequencer/output3,
 | Column                  | Description                                                                                                                                         |
 | ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `id`                    | Flowcell id                                                                                                                                         |
-| `samplesheet`           | Full path to the `SampleSheet.csv` file containing the sample information and indexes                                                               |
+| `samplesheet`           | Full path to the _flowcell_ `SampleSheet.csv` file containing the sample information and indexes                                                    |
 | `lane`                  | Optional lane number. When a lane number is provided, only the given lane will be demultiplexed                                                     |
 | `flowcell`              | Full path to the Illumina sequencer output directory (often referred as run directory) or a `tar.gz` file containing the contents of said directory |
 | `per_flowcell_manifest` | Full path to the flowcell manifest, containing the fastq file names and read structures                                                             |
 
+### Flowcell samplesheet
+
+Each demultiplexing software uses a distinct _flowcell_ samplesheet format. Below are examples for demultiplexer-specific _flowcell_ samplesheets. Please see the following examples to format the _flowcell_ `SampleSheet.csv`:
+
+| Demultiplexer                | Example _flowcell_ `SampleSheet.csv` Format                                                                                                            |
+| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **sgdemux**                  | [sgdemux SampleSheet.csv](https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/sim-data/out.sample_meta.csv)                             |
+| **fqtk**                     | [fqtk SampleSheet.csv](https://github.com/fulcrumgenomics/nf-core-test-datasets/raw/fqtk/testdata/sim-data/fqtk_samplesheet.csv)                       |
+| **bcl2fastq and bclconvert** | [bcl2fastq and bclconvert SampleSheet.csv](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv) |
+
 ## Running the pipeline
 
 The typical command for running the pipeline is as follows:
 
 ```bash
-nextflow run nf-core/demultiplex --input ./samplesheet.csv --outdir ./results -profile docker
+nextflow run nf-core/demultiplex \
+    --input pipeline_samplesheet.csv \
+    --outdir results \
+    -profile docker
 ```
 
 This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.

diff --git a/modules/local/fastq_to_samplesheet/main.nf b/modules/local/fastq_to_samplesheet/main.nf
@@ -0,0 +1,45 @@
+process FASTQ_TO_SAMPLESHEET {
+    tag "$meta.id"
+
+    executor 'local'
+    memory 100.MB
+
+    input:
+    val meta
+    val pipeline
+    val strandedness
+
+    output:
+    tuple val(meta), path("*samplesheet.csv"), emit: samplesheet
+
+    exec:
+
+    // Add relevant fields to the map
+    def pipeline_map = [
+        sample  : meta.samplename,
+        fastq_1 : meta.fastq_1
+    ]
+
+    // Add fastq_2 if it's a paired-end sample
+    if (!meta.single_end) {
+        pipeline_map.fastq_2 = meta.fastq_2
+    }
+
+    // Add pipeline-specific entries
+    if (pipeline == 'rnaseq') {
+        pipeline_map << [ strandedness: strandedness ]
+    } else if (pipeline == 'atacseq') {
+        pipeline_map << [ replicate: 1 ]
+    } else if (pipeline == 'taxprofiler') {
+        pipeline_map << [ fasta: '' ]
+    }
+
+    // Create the samplesheet content
+    def samplesheet = pipeline_map.keySet().collect { '"' + it + '"' }.join(",") + '\n'
+    samplesheet += pipeline_map.values().collect { '"' + it + '"' }.join(",")
+
+    // Write samplesheet to file
+    def samplesheet_file = task.workDir.resolve("${meta.id}.samplesheet.csv")
+    samplesheet_file.text = samplesheet
+
+}
diff --git a/modules/local/fastq_to_samplesheet/tests/main.nf.test b/modules/local/fastq_to_samplesheet/tests/main.nf.test
@@ -0,0 +1,30 @@
+nextflow_process {
+
+    name "Test Process FASTQ_TO_SAMPLESHEET"
+    script "../main.nf"
+    process "FASTQ_TO_SAMPLESHEET"
+
+    tag "modules"
+    tag "modules_local"
+    tag "fastq_to_samplesheet"
+
+    test("Should run without failures") {
+
+        when {
+            process {
+                """
+                input[0] = Channel.of([[id:'Sample1_S1_L001', samplename:'Sample1', fcid:'220422_M11111_0222_000000000-K9H97', lane:'1', empty:false, single_end:true, fastq_1:'Sample1_S1_L001_R1_001.fastq.gz']])
+                input[1] = 'rnaseq'
+                input[2] = 'auto'
+                """
+            }
+        }
+
+        then {
+            assertAll(
+                { assert process.success },
+                { assert snapshot(process.out).match() }
+            )
+        }
+    }
+}
diff --git a/modules/local/fastq_to_samplesheet/tests/main.nf.test.snap b/modules/local/fastq_to_samplesheet/tests/main.nf.test.snap
@@ -0,0 +1,45 @@
+{
+    "Should run without failures": {
+        "content": [
+            {
+                "0": [
+                    [
+                        [
+                            {
+                                "id": "Sample1_S1_L001",
+                                "samplename": "Sample1",
+                                "fcid": "220422_M11111_0222_000000000-K9H97",
+                                "lane": "1",
+                                "empty": false,
+                                "single_end": true,
+                                "fastq_1": "Sample1_S1_L001_R1_001.fastq.gz"
+                            }
+                        ],
+                        "[Sample1_S1_L001].samplesheet.csv:md5,bc779a8b2302a093cbb04a118bb5c90f"
+                    ]
+                ],
+                "samplesheet": [
+                    [
+                        [
+                            {
+                                "id": "Sample1_S1_L001",
+                                "samplename": "Sample1",
+                                "fcid": "220422_M11111_0222_000000000-K9H97",
+                                "lane": "1",
+                                "empty": false,
+                                "single_end": true,
+                                "fastq_1": "Sample1_S1_L001_R1_001.fastq.gz"
+                            }
+                        ],
+                        "[Sample1_S1_L001].samplesheet.csv:md5,bc779a8b2302a093cbb04a118bb5c90f"
+                    ]
+                ]
+            }
+        ],
+        "meta": {
+            "nf-test": "0.8.4",
+            "nextflow": "24.04.4"
+        },
+        "timestamp": "2024-08-09T22:00:18.282617632"
+    }
+}
diff --git a/nextflow.config b/nextflow.config
@@ -11,7 +11,7 @@ params {
 
     // Options: Generic
     input                       = null
-    demultiplexer               = "bclconvert"  // [bclconvert, bcl2fastq, bases2fastq, fqtk, sgdemux, mkfastq]
+    demultiplexer               = "bclconvert"  // enum string [bclconvert, bcl2fastq, bases2fastq, fqtk, sgdemux, mkfastq]
 
     // Options: trimming
     trim_fastq                  = true          // [true, false]
@@ -25,6 +25,10 @@ params {
 
     // Kraken2 options
     kraken_db                   = null         // file .tar.gz
+
+    // Downstream Nextflow pipeline
+    downstream_pipeline         = "default"     // enum string [rnaseq, atacseq, taxprofiler, default]
+
     // Options: CheckQC
     checkqc_config              = []           // file .yaml
 

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -29,8 +29,13 @@
                 "kraken_db": {
                     "type": "string",
                     "format": "path",
-                    "default": null,
-                    "description": "path to Kraken2 DB to use for screening"
+                    "description": "Path to Kraken2 DB to use for screening"
+                },
+                "downstream_pipeline": {
+                    "type": "string",
+                    "description": "Name of downstream nf-core pipeline (one of: rnaseq, atacseq, taxprofiler or default). Used to produce the input samplesheet for that pipeline.",
+                    "default": "default",
+                    "enum": ["rnaseq", "atacseq", "taxprofiler", "default"]
                 }
             }
         },

diff --git a/tests/pipeline/bases2fastq.nf.test b/tests/pipeline/bases2fastq.nf.test
@@ -19,7 +19,7 @@ nextflow_pipeline {
             assertAll(
                 { assert workflow.success },
                 { assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
-                { assert workflow.trace.succeeded().size() == 7 },
+                { assert workflow.trace.succeeded().size() == 8 },
                 { assert snapshot(
                     // FIXME
                     // path("$outputDir/sim-data/DefaultSample_R1.fastq.gz.md5"),

diff --git a/tests/pipeline/bcl2fastq.nf.test b/tests/pipeline/bcl2fastq.nf.test
@@ -20,7 +20,7 @@ nextflow_pipeline {
             assertAll(
                 { assert workflow.success },
                 { assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
-                { assert workflow.trace.succeeded().size() == 5 },
+                { assert workflow.trace.succeeded().size() == 6 },
                 { assert snapshot(
                         path("$outputDir/multiqc/multiqc_data/bcl2fastq_lane_counts.txt"),
                         path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),

diff --git a/tests/pipeline/bclconvert.nf.test b/tests/pipeline/bclconvert.nf.test
@@ -19,7 +19,7 @@ nextflow_pipeline {
             assertAll(
                 { assert workflow.success },
                 { assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
-                { assert workflow.trace.succeeded().size() == 5 },
+                { assert workflow.trace.succeeded().size() == 6 },
                 { assert snapshot(
                         path("$outputDir/multiqc/multiqc_data/bclconvert_lane_counts.txt"),
                         path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),

diff --git a/tests/pipeline/fqtk.nf.test b/tests/pipeline/fqtk.nf.test
@@ -19,7 +19,7 @@ nextflow_pipeline {
             assertAll(
                 { assert workflow.success },
                 { assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
-                { assert workflow.trace.succeeded().size() == 104 },
+                { assert workflow.trace.succeeded().size() == 129 },
                 { assert snapshot(path("$outputDir/test/demux-metrics.txt")).match("fqtk") },
                 { assert new File("$outputDir/test/unmatched_1.fastp.fastq.gz").exists() },
                 { assert new File("$outputDir/test/unmatched_2.fastp.fastq.gz").exists() },

diff --git a/tests/pipeline/kraken.nf.test b/tests/pipeline/kraken.nf.test
@@ -21,7 +21,7 @@ nextflow_pipeline {
             assertAll(
                 { assert workflow.success },
                 { assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
-                { assert workflow.trace.succeeded().size() == 8 },
+                { assert workflow.trace.succeeded().size() == 9 },
                 { assert snapshot(
                         path("$outputDir/multiqc/multiqc_data/bcl2fastq_lane_counts.txt"),
                         path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),

diff --git a/tests/pipeline/kraken.nf.test.snap b/tests/pipeline/kraken.nf.test.snap
@@ -57,19 +57,19 @@
         ],
         "meta": {
             "nf-test": "0.8.4",
-            "nextflow": "23.10.0"
+            "nextflow": "24.04.4"
         },
-        "timestamp": "2024-08-05T22:49:12.12938394"
+        "timestamp": "2024-08-09T17:17:23.034777828"
     },
     "software_versions": {
         "content": [
             "{BCL2FASTQ={bcl2fastq=2.20.0.422}, FALCO={falco=1.2.1}, FASTP={fastp=0.23.4}, KRAKEN2={kraken2=2.1.3, pigz=2.8}, MD5SUM={md5sum=8.3}, Workflow={nf-core/demultiplex=v1.5.0dev}}"
         ],
         "meta": {
             "nf-test": "0.8.4",
-            "nextflow": "23.10.0"
+            "nextflow": "24.04.4"
         },
-        "timestamp": "2024-08-01T22:34:15.140488001"
+        "timestamp": "2024-08-09T17:17:22.999406989"
     },
     "multiqc": {
         "content": [
@@ -80,8 +80,8 @@
         ],
         "meta": {
             "nf-test": "0.8.4",
-            "nextflow": "23.10.0"
+            "nextflow": "24.04.4"
         },
-        "timestamp": "2024-08-05T22:49:08.601265877"
+        "timestamp": "2024-08-09T17:17:23.014483899"
     }
 }