Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add samplesheet generation and improve usage documentation #236

Merged
merged 10 commits into from
Aug 10, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#220](https://github.com/nf-core/demultiplex/pull/220) Added kraken2.
- [#221](https://github.com/nf-core/demultiplex/pull/221) Added checkqc_config to pipeline schema.
- [#225](https://github.com/nf-core/demultiplex/pull/225) Added test profile for multi-lane samples, updated handling of such samples and adapter trimming.
- [#236](https://github.com/nf-core/demultiplex/pull/236) Add samplesheet generation.

### `Changed`

Expand Down
11 changes: 11 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Falco](#falco) - Raw read QC
- [md5sum](#md5sum) - Creates an MD5 (128-bit) checksum of every fastq.
- [kraken2](#kraken2) - Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to sequence reads.
- [samplesheet](#samplesheet) - Samplesheet generation for downstream nf-core pipelines.
- [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline

### bcl-convert
Expand Down Expand Up @@ -204,6 +205,16 @@ Creates an MD5 (128-bit) checksum of every fastq.

[Kraken](https://ccb.jhu.edu/software/kraken2/) is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps -mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

### Downstream pipeline samplesheet

<details markdown="1">
<summary>Output files</summary>

- `<outputdir>/samplesheet/`
- `*.csv`: Samplesheet with the generated FASTQ files formatted according to the selected downstream nf-core pipeline. Default: rnaseq format.

</details>

### Adapter sequence removal from samplesheet

<details markdown="1">
Expand Down
41 changes: 29 additions & 12 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,23 @@

## Introduction

## Samplesheet input
> [!IMPORTANT]
> It is relevant to distinguish between the _pipeline_ samplesheet and the _flowcell_ samplesheet before working with this pipeline.
>
> - The **_pipeline_ samplesheet** is a file provided as input to the nf-core pipeline itself. It contains the overall configuration for your run, specifying the paths to individual _flowcell_ samplesheets, flowcell directories, and other metadata required to manage multiple sequencing runs. This is the primary configuration file that directs the pipeline on how to process your data.
> - The **_flowcell_ samplesheet** is specific to a particular sequencing run. It is typically created by the sequencing facility and contains the sample information, including barcodes, lane numbers, and indexes. The typical name is `SampleSheet.csv`. Each demultiplexer may require a different format for this file, which must be adhered to for proper data processing.

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below. The input samplesheet is a comma-separated file that contains four columns: `id`, `samplesheet`, `lane`, `flowcell`.
## Pipeline samplesheet input

When using the demultiplexer fqtk, the samplesheet must contain an additional column `per_flowcell_manifest`. The column `per_flowcell_manifest` must contain two headers `fastq` and `read_structure`. As shown in the [example](https://github.com/fulcrumgenomics/nf-core-test-datasets/blob/fqtk/testdata/sim-data/per_flowcell_manifest.csv) provided each row must contain one fastq file name and the correlating read structure.
You will need to create a _pipeline_ samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with at least 4 columns, and a header row as shown in the examples below. The input _pipeline_ samplesheet is a comma-separated file that contains four columns: `id`, `samplesheet`, `lane`, `flowcell`.

When using the demultiplexer fqtk, the _pipeline_ samplesheet must contain an additional column `per_flowcell_manifest`. The column `per_flowcell_manifest` must contain two headers `fastq` and `read_structure`. As shown in the [example](https://github.com/fulcrumgenomics/nf-core-test-datasets/blob/fqtk/testdata/sim-data/per_flowcell_manifest.csv) provided each row must contain one fastq file name and the correlating read structure.

```bash
--input '[path to samplesheet file]'
--input '[path to pipeline samplesheet file]'
```

### Full samplesheet
#### Example: Pipeline samplesheet

```csv title="samplesheet.csv"
id,samplesheet,lane,flowcell
Expand All @@ -29,17 +35,15 @@ DDMMYY_SERIAL_NUMBER_FC3,/path/to/SampleSheet3.csv,3,/path/to/sequencer/output3
| Column | Description |
| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Flowcell id |
| `samplesheet` | Full path to the `SampleSheet.csv` file containing the sample information and indexes |
| `samplesheet` | Full path to the _flowcell_ `SampleSheet.csv` file containing the sample information and indexes |
| `lane` | Optional lane number. When a lane number is provided, only the given lane will be demultiplexed |
| `flowcell` | Full path to the Illumina sequencer output directory (often referred as run directory) or a `tar.gz` file containing the contents of said directory |

An [example samplesheet](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv) has been provided with the pipeline.
An [example _pipeline_ samplesheet](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv) has been provided with the pipeline.

Note that the run directory in the `flowcell` column must lead to a `tar.gz` for compatibility with the demultiplexers sgdemux and fqtk.

Each demultiplexing software uses a distinct samplesheet format. Below are examples for demultiplexer-specific samplesheets. Please see the following examples to format `SampleSheet.csv` for [sgdemux](https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/sim-data/out.sample_meta.csv), [fqtk](https://github.com/fulcrumgenomics/nf-core-test-datasets/raw/fqtk/testdata/sim-data/fqtk_samplesheet.csv), and [bcl2fastq and bclconvert](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv)

### Samplesheet for fqtk
#### Example: Pipeline samplesheet for fqtk

```csv title="samplesheet.csv"
id,samplesheet,lane,flowcell,per_flowcell_manifest
Expand All @@ -52,17 +56,30 @@ DDMMYY_SERIAL_NUMBER_FC3,/path/to/SampleSheet3.csv,3,/path/to/sequencer/output3,
| Column | Description |
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Flowcell id |
| `samplesheet` | Full path to the `SampleSheet.csv` file containing the sample information and indexes |
| `samplesheet` | Full path to the _flowcell_ `SampleSheet.csv` file containing the sample information and indexes |
| `lane` | Optional lane number. When a lane number is provided, only the given lane will be demultiplexed |
| `flowcell` | Full path to the Illumina sequencer output directory (often referred as run directory) or a `tar.gz` file containing the contents of said directory |
| `per_flowcell_manifest` | Full path to the flowcell manifest, containing the fastq file names and read structures |

### Flowcell samplesheet

Each demultiplexing software uses a distinct _flowcell_ samplesheet format. Below are examples for demultiplexer-specific _flowcell_ samplesheets. Please see the following examples to format the _flowcell_ `SampleSheet.csv`:

| Demultiplexer | Example _flowcell_ `SampleSheet.csv` Format |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **sgdemux** | [sgdemux SampleSheet.csv](https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/sim-data/out.sample_meta.csv) |
| **fqtk** | [fqtk SampleSheet.csv](https://github.com/fulcrumgenomics/nf-core-test-datasets/raw/fqtk/testdata/sim-data/fqtk_samplesheet.csv) |
| **bcl2fastq and bclconvert** | [bcl2fastq and bclconvert SampleSheet.csv](https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv) |

## Running the pipeline

The typical command for running the pipeline is as follows:

```bash
nextflow run nf-core/demultiplex --input ./samplesheet.csv --outdir ./results -profile docker
nextflow run nf-core/demultiplex \
--input pipeline_samplesheet.csv \
--outdir results \
-profile docker
```

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
Expand Down
53 changes: 53 additions & 0 deletions modules/local/fastq_to_samplesheet/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
process FASTQ_TO_SAMPLESHEET {
tag "$meta.id"

executor 'local'
memory 100.MB

input:
val meta
val pipeline
val strandedness

output:
tuple val(meta), path("*samplesheet.csv"), emit: samplesheet

exec:
// Clone metadata and remove unnecessary keys
def meta_clone = meta.clone()
meta_clone.remove("id")
meta_clone.remove("single_end")
meta_clone.remove("fcid")
meta_clone.remove("readgroup")
meta_clone.remove("empty")
meta_clone.remove("lane")

// // Add relevant fields to the map
def pipeline_map = [
sample : meta.samplename,
fastq_1 : meta.fastq_1
]

// // Add fastq_2 if it's a paired-end sample
if (!meta.single_end) {
pipeline_map.fastq_2 = meta.fastq_2
}

// Add pipeline-specific entries
if (pipeline == 'rnaseq') {
pipeline_map << [ strandedness: strandedness ]
} else if (pipeline == 'atacseq') {
pipeline_map << [ replicate: 1 ]
} else if (pipeline == 'taxprofiler') {
pipeline_map << [ fasta: '' ]
}

// Create the samplesheet content
def samplesheet = pipeline_map.keySet().collect { '"' + it + '"' }.join(",") + '\n'
samplesheet += pipeline_map.values().collect { '"' + it + '"' }.join(",")

// Write samplesheet to file
def samplesheet_file = task.workDir.resolve("${meta.id}.samplesheet.csv")
samplesheet_file.text = samplesheet

}
6 changes: 5 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ params {

// Options: Generic
input = null
demultiplexer = "bclconvert" // [bclconvert, bcl2fastq, bases2fastq, fqtk, sgdemux, mkfastq]
demultiplexer = "bclconvert" // enum string [bclconvert, bcl2fastq, bases2fastq, fqtk, sgdemux, mkfastq]

// Options: trimming
trim_fastq = true // [true, false]
Expand All @@ -25,6 +25,10 @@ params {

// Kraken2 options
kraken_db = null // file .tar.gz

// Downstream Nextflow pipeline
downstream_pipeline = "rnaseq" // enum string [rnaseq, atacseq, taxprofiler]

// Options: CheckQC
checkqc_config = [] // file .yaml

Expand Down
9 changes: 7 additions & 2 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,13 @@
"kraken_db": {
"type": "string",
"format": "path",
"default": null,
"description": "path to Kraken2 DB to use for screening"
"description": "Path to Kraken2 DB to use for screening"
},
"downstream_pipeline": {
"type": "string",
"description": "Name of downstream Nextflow pipeline (one of: rnaseq, atacseq or taxprofiler). Used to produce the input samplesheet for that pipeline.",
"default": "rnaseq",
"enum": ["rnaseq", "atacseq", "taxprofiler"]
}
}
},
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/bases2fastq.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 7 },
{ assert workflow.trace.succeeded().size() == 8 },
{ assert snapshot(
// FIXME
// path("$outputDir/sim-data/DefaultSample_R1.fastq.gz.md5"),
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/bcl2fastq.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 5 },
{ assert workflow.trace.succeeded().size() == 6 },
{ assert snapshot(
path("$outputDir/multiqc/multiqc_data/bcl2fastq_lane_counts.txt"),
path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/bclconvert.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 5 },
{ assert workflow.trace.succeeded().size() == 6 },
{ assert snapshot(
path("$outputDir/multiqc/multiqc_data/bclconvert_lane_counts.txt"),
path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/fqtk.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 104 },
{ assert workflow.trace.succeeded().size() == 129 },
{ assert snapshot(path("$outputDir/test/demux-metrics.txt")).match("fqtk") },
{ assert new File("$outputDir/test/unmatched_1.fastp.fastq.gz").exists() },
{ assert new File("$outputDir/test/unmatched_2.fastp.fastq.gz").exists() },
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/kraken.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 8 },
{ assert workflow.trace.succeeded().size() == 9 },
{ assert snapshot(
path("$outputDir/multiqc/multiqc_data/bcl2fastq_lane_counts.txt"),
path("$outputDir/multiqc/multiqc_data/fastp_filtered_reads_plot.txt"),
Expand Down
12 changes: 6 additions & 6 deletions tests/pipeline/kraken.nf.test.snap
Original file line number Diff line number Diff line change
Expand Up @@ -57,19 +57,19 @@
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.0"
"nextflow": "24.04.4"
},
"timestamp": "2024-08-05T22:49:12.12938394"
"timestamp": "2024-08-09T17:17:23.034777828"
},
"software_versions": {
"content": [
"{BCL2FASTQ={bcl2fastq=2.20.0.422}, FALCO={falco=1.2.1}, FASTP={fastp=0.23.4}, KRAKEN2={kraken2=2.1.3, pigz=2.8}, MD5SUM={md5sum=8.3}, Workflow={nf-core/demultiplex=v1.5.0dev}}"
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.0"
"nextflow": "24.04.4"
},
"timestamp": "2024-08-01T22:34:15.140488001"
"timestamp": "2024-08-09T17:17:22.999406989"
},
"multiqc": {
"content": [
Expand All @@ -80,8 +80,8 @@
],
"meta": {
"nf-test": "0.8.4",
"nextflow": "23.10.0"
"nextflow": "24.04.4"
},
"timestamp": "2024-08-05T22:49:08.601265877"
"timestamp": "2024-08-09T17:17:23.014483899"
}
}
4 changes: 2 additions & 2 deletions tests/pipeline/mkfastq.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 6 },
{ assert workflow.trace.succeeded().size() == 7 },
// How many directories were produced?
{assert path("${outputDir}").list().size() == 4},
{assert path("${outputDir}").list().size() == 6},
// How many files were produced?
{assert path("$outputDir/220422_M11111_0222_000000000-K9H97_mkfastq/").list().size() == 2},
{assert path("$outputDir/multiqc/").list().size() == 3},
Expand Down
2 changes: 1 addition & 1 deletion tests/pipeline/sgdemux.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ nextflow_pipeline {
assertAll(
{ assert workflow.success },
{ assert snapshot(UTILS.removeNextflowVersion("$outputDir")).match("software_versions") },
{ assert workflow.trace.succeeded().size() == 103 },
{ assert workflow.trace.succeeded().size() == 128 },
{ assert snapshot(
path("$outputDir/sim-data/metrics.tsv"),
path("$outputDir/sim-data/per_project_metrics.tsv"),
Expand Down
Loading
Loading