Merge branch 'dev' into dev

ggabernet · Mar 11, 2024 · afb1b48 · afb1b48
2 parents 2a55194 + 63b9692
commit afb1b48
Show file tree

Hide file tree

Showing 26 changed files with 728 additions and 59 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -56,6 +56,7 @@ jobs:
             "test_fetchimgt",
             "test_assembled_hs",
             "test_assembled_mm",
+            "test_10x_sc",
             "test_clontech_umi",
             "test_nebnext_umi",
           ]

diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,5 @@ package-lock.json
 .idea/
 nf-params.json
 .vscode/
+tests/
+test_flow/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 
 - [#294](https://github.com/nf-core/airrflow/pull/294) Merge template updates nf-core/tools v2.11.1
 - [#299](https://github.com/nf-core/airrflow/pull/299) Add profile for common NEB and TAKARA protocols
+- [#289](https://github.com/nf-core/airrflow/pull/289) Add possibility to merge multi-lane samples when starting from fastq files
+- [#289](https://github.com/nf-core/airrflow/pull/289) Add possibility to run cellranger for scVDJseq data
 
 ### `Fixed`
 

diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@
 
 ## Introduction
 
-**nf-core/airrflow** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io) toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, or assembled reads (bulk or single cell).
+**nf-core/airrflow** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io) toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, single-cell VDJ sequencing using the 10xGenomics libraries, or assembled reads (bulk or single-cell).
 
 ![nf-core/airrflow overview](docs/images/airrflow_workflow_overview.png)
 
@@ -34,18 +34,25 @@ nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single
 
 ![nf-core/airrflow overview](docs/images/metro-map-airrflow.png)
 
-1. QC and sequence assembly (bulk only)
-
-- Raw read quality control, adapter trimming and clipping (`Fastp`).
-- Filter sequences by base quality (`pRESTO FilterSeq`).
-- Mask amplicon primers (`pRESTO MaskPrimers`).
-- Pair read mates (`pRESTO PairSeq`).
-- For UMI-based sequencing:
-  - Cluster sequences according to similarity (optional for insufficient UMI diversity) (`pRESTO ClusterSets`).
-  - Build consensus of sequences with the same UMI barcode (`pRESTO BuildConsensus`).
-- Assemble R1 and R2 read mates (`pRESTO AssemblePairs`).
-- Remove and annotate read duplicates (`pRESTO CollapseSeq`).
-- Filter out sequences that do not have at least 2 duplicates (`pRESTO SplitSeq`).
+1. QC and sequence assembly
+
+- Bulk
+  - Raw read quality control, adapter trimming and clipping (`Fastp`).
+  - Filter sequences by base quality (`pRESTO FilterSeq`).
+  - Mask amplicon primers (`pRESTO MaskPrimers`).
+  - Pair read mates (`pRESTO PairSeq`).
+  - For UMI-based sequencing:
+    - Cluster sequences according to similarity (optional for insufficient UMI diversity) (`pRESTO ClusterSets`).
+    - Build consensus of sequences with the same UMI barcode (`pRESTO BuildConsensus`).
+  - Assemble R1 and R2 read mates (`pRESTO AssemblePairs`).
+  - Remove and annotate read duplicates (`pRESTO CollapseSeq`).
+  - Filter out sequences that do not have at least 2 duplicates (`pRESTO SplitSeq`).
+- single cell
+  - cellranger vdj
+    - Assemble contigs
+    - Annotate contigs
+    - Call cells
+    - Generate clonotypes
 
 2. V(D)J annotation and filtering (bulk and single-cell)
 
@@ -115,6 +122,18 @@ nextflow run nf-core/airrflow \
 --outdir ./results
 ```
 
+A typical command to run the pipeline from **single cell raw fastq files** (10X genomics) is:
+
+```bash
+nextflow run nf-core/airrflow -r dev \
+-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
+--mode fastq \
+--input input_samplesheet.tsv \
+--library_generation_method sc_10x_genomics \
+--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
+--outdir ./results
+```
+
 A typical command to run the pipeline from **single-cell AIRR rearrangement tables or assembled bulk sequencing fasta** data is:
 
 ```bash

diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -124,11 +124,6 @@ def check_samplesheet(file_in, assembled):
                         )
                     )
         else:
-            if any(tab["single_cell"].tolist()):
-                print_error(
-                    "Some single cell column values are TRUE. The raw mode only accepts bulk samples. If processing single cell samples, please set the `--mode assembled` flag, and provide an AIRR rearrangement as input."
-                )
-
             for col in required_columns_raw:
                 if col not in header:
                     print("ERROR: Please check samplesheet header: {} ".format(",".join(header)))
@@ -165,9 +160,12 @@ def check_samplesheet(file_in, assembled):
 
         ## Check that sample ids are unique
         if len(tab["sample_id"]) != len(set(tab["sample_id"])):
-            print_error(
-                "Sample IDs are not unique! The sample IDs in the input samplesheet should be unique for each sample."
-            )
+            if assembled:
+                print_error(
+                    "Sample IDs are not unique! The sample IDs in the input samplesheet should be unique for each sample."
+                )
+            else:
+                print("WARNING: Sample IDs are not unique! FastQs with the same sample ID will be merged.")
 
         ## Check that pcr_target_locus is IG or TR
         for val in tab["pcr_target_locus"]:

diff --git a/bin/reveal_add_metadata.R b/bin/reveal_add_metadata.R
@@ -61,8 +61,12 @@ if (!("INPUTID" %in% names(opt))) {
 # Read metadata file
 metadata <- read.csv(opt$METADATA, sep = "\t", header = TRUE, stringsAsFactors = F)
 
+# Merging samples over multiple lanes introduces multi-rows per sample
+# We expect only one row per sample
 metadata <- metadata %>%
-    filter(sample_id == opt$INPUTID)
+    dplyr::filter(sample_id == opt$INPUTID) %>%
+    dplyr::select(!starts_with("filename_")) %>%
+    dplyr::distinct()
 
 if (nrow(metadata) != 1) {
     stop("Expecting nrow(metadata) == 1; nrow(metadata) == ", nrow(metadata), " found")
@@ -81,10 +85,7 @@ internal_fields <-
         "id",
         "filetype",
         "valid_single_cell",
-        "valid_pcr_target_locus",
-        "filename_R1",
-        "filename_R2",
-        "filename_I1"
+        "valid_pcr_target_locus"
     )
 metadata <- metadata[, !colnames(metadata) %in% internal_fields]
 

diff --git a/conf/test_10x_sc.config b/conf/test_10x_sc.config
@@ -0,0 +1,28 @@
+/*
+ * -------------------------------------------------
+ *  Nextflow config file for running tests
+ * -------------------------------------------------
+ * Defines bundled input files and everything required
+ * to run a fast and simple test. Use as follows:
+ *   nextflow run nf-core/airrflow -profile test_10x_sc,<docker/singularity>
+ */
+
+params {
+    config_profile_name = 'Test 10xGenomics single cell data'
+    config_profile_description = 'Minimal test dataset to check pipeline function with raw single cell data from 10xGenomics'
+
+    // Limit resources so that this can run on GitHub Actions
+    max_cpus = 2
+    max_memory = 6.GB
+    max_time = 48.h
+
+    // params
+    mode = 'fastq'
+    library_generation_method = 'sc_10x_genomics'
+    clonal_threshold = 0
+
+
+    // Input data
+    input = 'https://raw.githubusercontent.com/nf-core/test-datasets/airrflow/testdata-sc/10x_sc_raw.tsv'
+    reference_10x = 'https://raw.githubusercontent.com/nf-core/test-datasets/airrflow/testdata-sc/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz'
+}
diff --git a/docs/usage.md b/docs/usage.md
@@ -39,6 +39,18 @@ nextflow run nf-core/airrflow \
 --outdir results
 ```
 
+A typical command to run the pipeline from **single cell raw fastq files** is:
+
+```bash
+nextflow run nf-core/airrflow -r dev \
+-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
+--mode fastq \
+--input input_samplesheet.tsv \
+--library_generation_method sc_10x_genomics \
+--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
+--outdir ./results
+```
+
 A typical command for running the pipeline departing from **single-cell AIRR rearrangement tables or assembled bulk sequencing fasta** data is:
 
 ```bash
@@ -49,7 +61,7 @@ nextflow run nf-core/airrflow \
 --outdir results
 ```
 
-Check the section [Input samplesheet](#input-samplesheet) below for instructions on how to create the samplesheet, and the [Supported library generation protocols](#supported-bulk-library-generation-methods-protocols) section below for examples on how to run the pipeline for different bulk sequencing protocols.
+Check the section [Input samplesheet](#input-samplesheet) below for instructions on how to create the samplesheet, and the [Supported library generation protocols](#supported-bulk-library-generation-methods-protocols) section below for examples on how to run the pipeline for different bulk and the 10xGenomics single cell sequencing protocol.
 For more information about the parameters, please refer to the [parameters documentation](https://nf-co.re/airrflow/parameters).
 The command above will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
 
@@ -111,7 +123,7 @@ If you wish to share such profile (such as upload as supplementary material for
 
 ## Input samplesheet
 
-### Fastq input samplesheet (bulk sequencing only)
+### Fastq input samplesheet (bulk sequencing)
 
 The required input file for processing raw BCR or TCR bulk targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename_R1`, `filename_R2`, `subject_id`, `species`, `tissue`, `pcr_target_locus`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. An example samplesheet is:
 
@@ -131,7 +143,7 @@ The required input file for processing raw BCR or TCR bulk targeted sequencing d
 - `biomaterial_provider`: Institution / research group that provided the samples.
 - `sex`: Subject biological sex (`female`, `male`, etc.).
 - `age`: Subject biological age.
-- `single_cell`: TRUE or FALSE. Fastq input samplesheet only supports a FALSE value.
+- `single_cell`: TRUE or FALSE.
 
 Other optional columns can be added. These columns will be available when building the contrasts for the repertoire comparison report. It is recommended that these columns also follow the AIRR nomenclature. Examples are:
 
@@ -143,6 +155,25 @@ Other optional columns can be added. These columns will be available when buildi
 
 The metadata specified in the input file will then be automatically annotated in a column with the same header in the tables generated by the pipeline.
 
+### Fastq input samplesheet (single cell sequencing)
+
+The required input file for processing raw BCR or TCR single cell targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename_R1`, `filename_R2`, `subject_id`, `species`, `tissue`, `pcr_target_locus`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. You can refer to the bulk fastq input section for documentation on the individual columns.
+An example samplesheet is:
+
+| sample_id | filename_R1                     | filename_R2                     | subject_id | species | pcr_target_locus | tissue | sex    | age | biomaterial_provider | single_cell | intervention   | collection_time_point_relative | cell_subset  |
+| --------- | ------------------------------- | ------------------------------- | ---------- | ------- | ---------------- | ------ | ------ | --- | -------------------- | ----------- | -------------- | ------------------------------ | ------------ |
+| sample01  | sample1_S1_L001_R1_001.fastq.gz | sample1_S1_L001_R2_001.fastq.gz | Subject02  | human   | IG               | blood  | NA     | 53  | sequencing_facility  | FALSE       | Drug_treatment | Baseline                       | plasmablasts |
+| sample02  | sample2_S1_L001_R1_001.fastq.gz | sample2_S1_L001_R2_001.fastq.gz | Subject02  | human   | TR               | blood  | female | 78  | sequencing_facility  | FALSE       | Drug_treatment | Baseline                       | plasmablasts |
+
+> FASTQ files must confirm the 10xGenomics cellranger naming conventions<br> >**`[SAMPLE-NAME]`_S1_L00`[LANE-NUMBER]` _`[READ-TYPE]`\_001.fastq.gz**
+>
+> Read type is one of
+>
+> - `I1`: Sample index read (optional)
+> - `I2`: Sample index read (optional)
+> - `R1`: Read 1
+> - `R2`: Read 2
+
 ### Assembled input samplesheet (bulk or single-cell sequencing)
 
 The required input file for processing raw BCR or TCR bulk targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename`, `subject_id`, `species`, `tissue`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. All fields are explained in the previous section, with the only difference being that there is only one `filename` column for the assembled input samplesheet. The provided file will be different from assembled single-cell or bulk data:
@@ -380,7 +411,7 @@ This sequencing type requires setting `--library_generation_method race_5p_umi`
 
 #### Takara Bio SMARTer Human BCR
 
-The read configuration when sequenicng with the TAKARA Bio SMARTer Human BCR protocol is the following:
+The read configuration when sequencing with the TAKARA Bio SMARTer Human BCR protocol is the following:
 
 ![nf-core/airrflow](images/TAKARA_RACE_BCR.png)
 
@@ -449,6 +480,37 @@ The UMI barcodes are typically read from an index file but sometimes can be prov
 
 - No UMIs in R1 or R2 reads: if no UMIs are present in the samples, specify `--umi_length 0` to use the sans-UMI subworkflow.
 
+## Supported single cell library generation methods (protocols)
+
+When processing single cell sequencing data departing from raw `fastq` reads, currently only a `--library_generation_method` to support 10xGenomics data is available.
+
+| Library generation methods | Description                                                                                                 | Name in pipeline | Commercial protocols |
+| -------------------------- | ----------------------------------------------------------------------------------------------------------- | ---------------- | -------------------- |
+| RT(RHP)+PCR                | sequencing data produced from Chromium single cell 5'V(D)J libraries containing cellular barcodes and UMIs. | sc_10x_genomics  | 10xGenomics          |
+
+### 10xGenomics
+
+This sequencing type requires setting `--library_generation_method sc_10x_genomics`.
+The `cellranger vdj` automatically uses the Chromium cellular barcodes and UMIs to perform sequence assembly, paired clonotype calling and to assemble V(D)J transcripts per cell.
+Examples are provided below to run airrflow to process 10xGenomics raw FASTQ data.
+
+```bash
+nextflow run nf-core/airrflow -r dev \
+-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
+--mode fastq \
+--input input_samplesheet.tsv \
+--library_generation_method sc_10x_genomics \
+--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
+--outdir ./results
+```
+
+#### 10xGenomics reference
+
+10xGenomics requires a reference. This can be provided using the `--reference_10x` parameter.
+
+- The 10xGenomics reference can be downloaded from the [download page](https://www.10xgenomics.com/support/software/cell-ranger/downloads)
+- To generate a V(D)J segment fasta file as reference from IMGT one can follow the [cellranger docs](https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/advanced/references#imgt).
+
 ## Core Nextflow arguments
 
 :::note

diff --git a/modules.json b/modules.json
@@ -5,6 +5,21 @@
         "https://github.com/nf-core/modules.git": {
             "modules": {
                 "nf-core": {
+                    "cat/fastq": {
+                        "branch": "master",
+                        "git_sha": "02fd5bd7275abad27aad32d5c852e0a9b1b98882",
+                        "installed_by": ["modules"]
+                    },
+                    "cellranger/mkvdjref": {
+                        "branch": "master",
+                        "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
+                        "installed_by": ["modules"]
+                    },
+                    "cellranger/vdj": {
+                        "branch": "master",
+                        "git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
+                        "installed_by": ["modules"]
+                    },
                     "custom/dumpsoftwareversions": {
                         "branch": "master",
                         "git_sha": "de45447d060b8c8b98575bc637a4a575fd0638e1",

diff --git a/modules/local/unzip_cellrangerdb.nf b/modules/local/unzip_cellrangerdb.nf
@@ -0,0 +1,29 @@
+process UNZIP_CELLRANGERDB {
+    tag "unzip_cellrangerdb"
+    label 'process_single'
+
+    conda "${moduleDir}/environment.yml"
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/ubuntu:20.04' :
+        'nf-core/ubuntu:20.04' }"
+
+    input:
+    path(archive)
+
+    output:
+    path("$unzipped")   , emit: unzipped
+    path "versions.yml", emit: versions
+
+    script:
+    unzipped = archive.toString() - '.tar.gz'
+    """
+    echo "${unzipped}"
+
+    tar -xzvf ${archive}
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        unzip_cellrangerdb: \$(echo \$(tar --version 2>&1 | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//')
+    END_VERSIONS
+    """
+}
diff --git a/modules/nf-core/cat/fastq/environment.yml b/modules/nf-core/cat/fastq/environment.yml