Skip to content

Commit

Permalink
Merge pull request #236 from atrigila/add_samplesheet_generation
Browse files Browse the repository at this point in the history
  • Loading branch information
apeltzer authored Aug 10, 2024
2 parents e3d6e94 + 16e6b1a commit 8353819
Show file tree
Hide file tree
Showing 19 changed files with 224 additions and 40 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#220](https://github.com/nf-core/demultiplex/pull/220) Added kraken2.
- [#221](https://github.com/nf-core/demultiplex/pull/221) Added checkqc_config to pipeline schema.
- [#225](https://github.com/nf-core/demultiplex/pull/225) Added test profile for multi-lane samples, updated handling of such samples and adapter trimming.
- [#236](https://github.com/nf-core/demultiplex/pull/236) Add samplesheet generation.

### `Changed`

Expand Down
11 changes: 11 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Falco](#falco) - Raw read QC
- [md5sum](#md5sum) - Creates an MD5 (128-bit) checksum of every fastq.
- [kraken2](#kraken2) - Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to sequence reads.
- [samplesheet](#samplesheet) - Samplesheet generation for downstream nf-core pipelines.
- [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline

### bcl-convert
Expand Down Expand Up @@ -204,6 +205,16 @@ Creates an MD5 (128-bit) checksum of every fastq.

[Kraken](https://ccb.jhu.edu/software/kraken2/) is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

### Downstream pipeline samplesheet

<details markdown="1">
<summary>Output files</summary>

- `<outputdir>/samplesheet/`
- `*.csv`: Samplesheet with the generated FASTQ files formatted according to the selected downstream nf-core pipeline. Default: rnaseq format.

</details>

### Adapter sequence removal from samplesheet

<details markdown="1">
Expand Down
41 changes: 29 additions & 12 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,23 @@
## Introduction

## Samplesheet input
> [!IMPORTANT]
> It is relevant to distinguish between the _pipeline_ samplesheet and the _flowcell_ samplesheet before working with this pipeline.
>
> - The **_pipeline_ samplesheet** is a file provided as input to the nf-core pipeline itself. It contains the overall configuration for your run, specifying the paths to individual _flowcell_ samplesheets, flowcell directories, and other metadata required to manage multiple sequencing runs. This is the primary configuration file that directs the pipeline on how to process your data.
> - The **_flowcell_ samplesheet** is specific to a particular sequencing run. It is typically created by the sequencing facility and contains the sample information, including barcodes, lane numbers, and indexes. The typical name is `SampleSheet.csv`. Each demultiplexer may require a different format for this file, which must be adhered to for proper data processing.
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below. The input samplesheet is a comma-separated file that contains four columns: `id`, `samplesheet`, `lane`, `flowcell`.
## Pipeline samplesheet input

When using the demultiplexer fqtk, the samplesheet must contain an additional column `per_flowcell_manifest`. The column `per_flowcell_manifest` must contain two headers `fastq` and `read_structure`. As shown in the [example](https://github.com/fulcrumgenomics/nf-core-test-datasets/blob/fqtk/testdata/sim-data/per_flowcell_manifest.csv) provided each row must contain one fastq file name and the correlating read structure.
You will need to create a _pipeline_ samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below. The input _pipeline_ samplesheet is a comma-separated file that contains four columns: `id`, `samplesheet`, `lane`, `flowcell`.

When using the demultiplexer fqtk, the _pipeline_ samplesheet must contain an additional column `per_flowcell_manifest`. The column `per_flowcell_manifest` must contain two headers `fastq` and `read_structure`. As shown in the [example](https://github.com/fulcrumgenomics/nf-core-test-datasets/blob/fqtk/testdata/sim-data/per_flowcell_manifest.csv) provided each row must contain one fastq file name and the correlating read structure.

```bash
--input '[path to samplesheet file]'
--input '[path to pipeline samplesheet file]'
```

### Full samplesheet
#### Example: Pipeline samplesheet

```csv title="samplesheet.csv"
id,samplesheet,lane,flowcell
Expand All @@ -29,17 +35,15 @@ DDMMYY_SERIAL_NUMBER_FC3,/path/to/SampleSheet3.csv,3,/path/to/sequencer/output3
| Column | Description |
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Flowcell id |
| `samplesheet` | Full path to the `SampleSheet.csv` file containing the sample information and indexes |
| `samplesheet` | Full path to the _flowcell_ `SampleSheet.csv` file containing the sample information and indexes |
| `lane` | Optional lane number. When a lane number is provided, only the given lane will be demultiplexed |
| `flowcell` | Full path to the Illumina sequencer output directory (often referred as run directory) or a `tar.gz` file containing the contents of said directory |

An [example samplesheet](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv) has been provided with the pipeline.
An [example _pipeline_ samplesheet](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv) has been provided with the pipeline.

Note that the run directory in the `flowcell` column must lead to a `tar.gz` for compatibility with the demultiplexers sgdemux and fqtk.

Each demultiplexing software uses a distinct samplesheet format. Below are examples for demultiplexer-specific samplesheets. Please see the following examples to format `SampleSheet.csv` for [sgdemux](https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/sim-data/out.sample_meta.csv), [fqtk](https://github.com/fulcrumgenomics/nf-core-test-datasets/raw/fqtk/testdata/sim-data/fqtk_samplesheet.csv), and [bcl2fastq and bclconvert](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv)

### Samplesheet for fqtk
#### Example: Pipeline samplesheet for fqtk

```csv title="samplesheet.csv"
id,samplesheet,lane,flowcell,per_flowcell_manifest
Expand All @@ -52,17 +56,30 @@ DDMMYY_SERIAL_NUMBER_FC3,/path/to/SampleSheet3.csv,3,/path/to/sequencer/output3,
| Column | Description |
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Flowcell id |
| `samplesheet` | Full path to the `SampleSheet.csv` file containing the sample information and indexes |
| `samplesheet` | Full path to the _flowcell_ `SampleSheet.csv` file containing the sample information and indexes |
| `lane` | Optional lane number. When a lane number is provided, only the given lane will be demultiplexed |
| `flowcell` | Full path to the Illumina sequencer output directory (often referred as run directory) or a `tar.gz` file containing the contents of said directory |
| `per_flowcell_manifest` | Full path to the flowcell manifest, containing the fastq file names and read structures |

### Flowcell samplesheet

Each demultiplexing software uses a distinct _flowcell_ samplesheet format. Below are examples for demultiplexer-specific _flowcell_ samplesheets. Please see the following examples to format the _flowcell_ `SampleSheet.csv`:

| Demultiplexer | Example _flowcell_ `SampleSheet.csv` Format |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **sgdemux** | [sgdemux SampleSheet.csv](https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/sim-data/out.sample_meta.csv) |
| **fqtk** | [fqtk SampleSheet.csv](https://github.com/fulcrumgenomics/nf-core-test-datasets/raw/fqtk/testdata/sim-data/fqtk_samplesheet.csv) |
| **bcl2fastq and bclconvert** | [bcl2fastq and bclconvert SampleSheet.csv](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv) |

## Running the pipeline

The typical command for running the pipeline is as follows:

```bash
nextflow run nf-core/demultiplex --input ./samplesheet.csv --outdir ./results -profile docker
nextflow run nf-core/demultiplex \
--input pipeline_samplesheet.csv \
--outdir results \
-profile docker
```

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
Expand Down
45 changes: 45 additions & 0 deletions modules/local/fastq_to_samplesheet/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
process FASTQ_TO_SAMPLESHEET {
tag "$meta.id"

executor 'local'
memory 100.MB

input:
val meta
val pipeline
val strandedness

output:
tuple val(meta), path("*samplesheet.csv"), emit: samplesheet

exec:

// Add relevant fields to the map
def pipeline_map = [
sample : meta.samplename,
fastq_1 : meta.fastq_1
]

// Add fastq_2 if it's a paired-end sample
if (!meta.single_end) {
pipeline_map.fastq_2 = meta.fastq_2
}

// Add pipeline-specific entries
if (pipeline == 'rnaseq') {
pipeline_map << [ strandedness: strandedness ]
} else if (pipeline == 'atacseq') {
pipeline_map << [ replicate: 1 ]
} else if (pipeline == 'taxprofiler') {
pipeline_map << [ fasta: '' ]
}

// Create the samplesheet content
def samplesheet = pipeline_map.keySet().collect { '"' + it + '"' }.join(",") + '\n'
samplesheet += pipeline_map.values().collect { '"' + it + '"' }.join(",")

// Write samplesheet to file
def samplesheet_file = task.workDir.resolve("${meta.id}.samplesheet.csv")
samplesheet_file.text = samplesheet

}
30 changes: 30 additions & 0 deletions modules/local/fastq_to_samplesheet/tests/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
nextflow_process {

name "Test Process FASTQ_TO_SAMPLESHEET"
script "../main.nf"
process "FASTQ_TO_SAMPLESHEET"

tag "modules"
tag "modules_local"
tag "fastq_to_samplesheet"

test("Should run without failures") {

when {
process {
"""
input[0] = Channel.of([[id:'Sample1_S1_L001', samplename:'Sample1', fcid:'220422_M11111_0222_000000000-K9H97', lane:'1', empty:false, single_end:true, fastq_1:'Sample1_S1_L001_R1_001.fastq.gz']])
input[1] = 'rnaseq'
input[2] = 'auto'
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() }
)
}
}
}
45 changes: 45 additions & 0 deletions modules/local/fastq_to_samplesheet/tests/main.nf.test.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"Should run without failures": {
"content": [
{
"0": [
[
[
{
"id": "Sample1_S1_L001",
"samplename": "Sample1",
"fcid": "220422_M11111_0222_000000000-K9H97",
"lane": "1",
"empty": false,
"single_end": true,
"fastq_1": "Sample1_S1_L001_R1_001.fastq.gz"
}
],
"[Sample1_S1_L001].samplesheet.csv:md5,bc779a8b2302a093cbb04a118bb5c90f"
]
],
"samplesheet": [
[
[
{
"id": "Sample1_S1_L001",
"samplename": "Sample1",
"fcid": "220422_M11111_0222_000000000-K9H97",
"lane": "1",
"empty": false,
"single_end": true,
"fastq_1": "Sample1_S1_L001_R1_001.fastq.gz"
}
],
"[Sample1_S1_L001].samplesheet.csv:md5,bc779a8b2302a093cbb04a118bb5c90f"
]
]
}
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "24.04.4"
},
"timestamp": "2024-08-09T22:00:18.282617632"
}
}
6 changes: 5 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ params {

// Options: Generic
input = null
demultiplexer = "bclconvert" // [bclconvert, bcl2fastq, bases2fastq, fqtk, sgdemux, mkfastq]
demultiplexer = "bclconvert" // enum string [bclconvert, bcl2fastq, bases2fastq, fqtk, sgdemux, mkfastq]

// Options: trimming
trim_fastq = true // [true, false]
Expand All @@ -25,6 +25,10 @@ params {

// Kraken2 options
kraken_db = null // file .tar.gz

// Downstream Nextflow pipeline
downstream_pipeline = "default" // enum string [rnaseq, atacseq, taxprofiler, default]

// Options: CheckQC
checkqc_config = [] // file .yaml

Expand Down
9 changes: 7 additions & 2 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,13 @@
"kraken_db": {
"type": "string",
"format": "path",
"default": null,
"description": "path to Kraken2 DB to use for screening"
"description": "Path to Kraken2 DB to use for screening"
},
"downstream_pipeline": {
"type": "string",
"description": "Name of downstream nf-core pipeline (one of: rnaseq, atacseq, taxprofiler or default). Used to produce the input samplesheet for that pipeline.",
"default": "default",
"enum": ["rnaseq", "atacseq", "taxprofiler", "default"]
}
}
},
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/bases2fastq.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 7 },
{ assert workflow.trace.succeeded().size() == 8 },
{ assert snapshot(
// FIXME
// path("$outputDir/sim-data/DefaultSample_R1.fastq.gz.md5"),
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/bcl2fastq.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 5 },
{ assert workflow.trace.succeeded().size() == 6 },
{ assert snapshot(
path("$outputDir/multiqc/multiqc_data/bcl2fastq_lane_counts.txt"),
path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/bclconvert.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 5 },
{ assert workflow.trace.succeeded().size() == 6 },
{ assert snapshot(
path("$outputDir/multiqc/multiqc_data/bclconvert_lane_counts.txt"),
path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/fqtk.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 104 },
{ assert workflow.trace.succeeded().size() == 129 },
{ assert snapshot(path("$outputDir/test/demux-metrics.txt")).match("fqtk") },
{ assert new File("$outputDir/test/unmatched_1.fastp.fastq.gz").exists() },
{ assert new File("$outputDir/test/unmatched_2.fastp.fastq.gz").exists() },
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/kraken.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 8 },
{ assert workflow.trace.succeeded().size() == 9 },
{ assert snapshot(
path("$outputDir/multiqc/multiqc_data/bcl2fastq_lane_counts.txt"),
path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),
Expand Down
12 changes: 6 additions & 6 deletions tests/pipeline/kraken.nf.test.snap
Original file line number Diff line number Diff line change
Expand Up @@ -57,19 +57,19 @@
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.0"
"nextflow": "24.04.4"
},
"timestamp": "2024-08-05T22:49:12.12938394"
"timestamp": "2024-08-09T17:17:23.034777828"
},
"software_versions": {
"content": [
"{BCL2FASTQ={bcl2fastq=2.20.0.422}, FALCO={falco=1.2.1}, FASTP={fastp=0.23.4}, KRAKEN2={kraken2=2.1.3, pigz=2.8}, MD5SUM={md5sum=8.3}, Workflow={nf-core/demultiplex=v1.5.0dev}}"
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.0"
"nextflow": "24.04.4"
},
"timestamp": "2024-08-01T22:34:15.140488001"
"timestamp": "2024-08-09T17:17:22.999406989"
},
"multiqc": {
"content": [
Expand All @@ -80,8 +80,8 @@
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.0"
"nextflow": "24.04.4"
},
"timestamp": "2024-08-05T22:49:08.601265877"
"timestamp": "2024-08-09T17:17:23.014483899"
}
}
Loading

0 comments on commit 8353819

Please sign in to comment.