Skip to content

Commit

Permalink
Merge branch 'dev' into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
ggabernet authored Mar 11, 2024
2 parents 2a55194 + 63b9692 commit afb1b48
Show file tree
Hide file tree
Showing 26 changed files with 728 additions and 59 deletions.
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ jobs:
"test_fetchimgt",
"test_assembled_hs",
"test_assembled_mm",
"test_10x_sc",
"test_clontech_umi",
"test_nebnext_umi",
]
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ package-lock.json
.idea/
nf-params.json
.vscode/
tests/
test_flow/
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

- [#294](https://github.com/nf-core/airrflow/pull/294) Merge template updates nf-core/tools v2.11.1
- [#299](https://github.com/nf-core/airrflow/pull/299) Add profile for common NEB and TAKARA protocols
- [#289](https://github.com/nf-core/airrflow/pull/289) Add possibility to merge multi-lane samples when starting from fastq files
- [#289](https://github.com/nf-core/airrflow/pull/289) Add possibility to run cellranger for scVDJseq data

### `Fixed`

Expand Down
45 changes: 32 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

## Introduction

**nf-core/airrflow** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io) toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, or assembled reads (bulk or single cell).
**nf-core/airrflow** is a bioinformatics best-practice pipeline to analyze B-cell or T-cell repertoire sequencing data. It makes use of the [Immcantation](https://immcantation.readthedocs.io) toolset. The input data can be targeted amplicon bulk sequencing data of the V, D, J and C regions of the B/T-cell receptor with multiplex PCR or 5' RACE protocol, single-cell VDJ sequencing using the 10xGenomics libraries, or assembled reads (bulk or single-cell).

![nf-core/airrflow overview](docs/images/airrflow_workflow_overview.png)

Expand All @@ -34,18 +34,25 @@ nf-core/airrflow allows the end-to-end processing of BCR and TCR bulk and single

![nf-core/airrflow overview](docs/images/metro-map-airrflow.png)

1. QC and sequence assembly (bulk only)

- Raw read quality control, adapter trimming and clipping (`Fastp`).
- Filter sequences by base quality (`pRESTO FilterSeq`).
- Mask amplicon primers (`pRESTO MaskPrimers`).
- Pair read mates (`pRESTO PairSeq`).
- For UMI-based sequencing:
- Cluster sequences according to similarity (optional for insufficient UMI diversity) (`pRESTO ClusterSets`).
- Build consensus of sequences with the same UMI barcode (`pRESTO BuildConsensus`).
- Assemble R1 and R2 read mates (`pRESTO AssemblePairs`).
- Remove and annotate read duplicates (`pRESTO CollapseSeq`).
- Filter out sequences that do not have at least 2 duplicates (`pRESTO SplitSeq`).
1. QC and sequence assembly

- Bulk
- Raw read quality control, adapter trimming and clipping (`Fastp`).
- Filter sequences by base quality (`pRESTO FilterSeq`).
- Mask amplicon primers (`pRESTO MaskPrimers`).
- Pair read mates (`pRESTO PairSeq`).
- For UMI-based sequencing:
- Cluster sequences according to similarity (optional for insufficient UMI diversity) (`pRESTO ClusterSets`).
- Build consensus of sequences with the same UMI barcode (`pRESTO BuildConsensus`).
- Assemble R1 and R2 read mates (`pRESTO AssemblePairs`).
- Remove and annotate read duplicates (`pRESTO CollapseSeq`).
- Filter out sequences that do not have at least 2 duplicates (`pRESTO SplitSeq`).
- single cell
- cellranger vdj
- Assemble contigs
- Annotate contigs
- Call cells
- Generate clonotypes

2. V(D)J annotation and filtering (bulk and single-cell)

Expand Down Expand Up @@ -115,6 +122,18 @@ nextflow run nf-core/airrflow \
--outdir ./results
```

A typical command to run the pipeline from **single cell raw fastq files** (10X genomics) is:

```bash
nextflow run nf-core/airrflow -r dev \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
--mode fastq \
--input input_samplesheet.tsv \
--library_generation_method sc_10x_genomics \
--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
--outdir ./results
```

A typical command to run the pipeline from **single-cell AIRR rearrangement tables or assembled bulk sequencing fasta** data is:

```bash
Expand Down
14 changes: 6 additions & 8 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,11 +124,6 @@ def check_samplesheet(file_in, assembled):
)
)
else:
if any(tab["single_cell"].tolist()):
print_error(
"Some single cell column values are TRUE. The raw mode only accepts bulk samples. If processing single cell samples, please set the `--mode assembled` flag, and provide an AIRR rearrangement as input."
)

for col in required_columns_raw:
if col not in header:
print("ERROR: Please check samplesheet header: {} ".format(",".join(header)))
Expand Down Expand Up @@ -165,9 +160,12 @@ def check_samplesheet(file_in, assembled):

## Check that sample ids are unique
if len(tab["sample_id"]) != len(set(tab["sample_id"])):
print_error(
"Sample IDs are not unique! The sample IDs in the input samplesheet should be unique for each sample."
)
if assembled:
print_error(
"Sample IDs are not unique! The sample IDs in the input samplesheet should be unique for each sample."
)
else:
print("WARNING: Sample IDs are not unique! FastQs with the same sample ID will be merged.")

## Check that pcr_target_locus is IG or TR
for val in tab["pcr_target_locus"]:
Expand Down
11 changes: 6 additions & 5 deletions bin/reveal_add_metadata.R
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,12 @@ if (!("INPUTID" %in% names(opt))) {
# Read metadata file
metadata <- read.csv(opt$METADATA, sep = "\t", header = TRUE, stringsAsFactors = F)

# Merging samples over multiple lanes introduces multi-rows per sample
# We expect only one row per sample
metadata <- metadata %>%
filter(sample_id == opt$INPUTID)
dplyr::filter(sample_id == opt$INPUTID) %>%
dplyr::select(!starts_with("filename_")) %>%
dplyr::distinct()

if (nrow(metadata) != 1) {
stop("Expecting nrow(metadata) == 1; nrow(metadata) == ", nrow(metadata), " found")
Expand All @@ -81,10 +85,7 @@ internal_fields <-
"id",
"filetype",
"valid_single_cell",
"valid_pcr_target_locus",
"filename_R1",
"filename_R2",
"filename_I1"
"valid_pcr_target_locus"
)
metadata <- metadata[, !colnames(metadata) %in% internal_fields]

Expand Down
28 changes: 28 additions & 0 deletions conf/test_10x_sc.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* -------------------------------------------------
* Nextflow config file for running tests
* -------------------------------------------------
* Defines bundled input files and everything required
* to run a fast and simple test. Use as follows:
* nextflow run nf-core/airrflow -profile test_10x_sc,<docker/singularity>
*/

params {
config_profile_name = 'Test 10xGenomics single cell data'
config_profile_description = 'Minimal test dataset to check pipeline function with raw single cell data from 10xGenomics'

// Limit resources so that this can run on GitHub Actions
max_cpus = 2
max_memory = 6.GB
max_time = 48.h

// params
mode = 'fastq'
library_generation_method = 'sc_10x_genomics'
clonal_threshold = 0


// Input data
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/airrflow/testdata-sc/10x_sc_raw.tsv'
reference_10x = 'https://raw.githubusercontent.com/nf-core/test-datasets/airrflow/testdata-sc/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz'
}
70 changes: 66 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,18 @@ nextflow run nf-core/airrflow \
--outdir results
```

A typical command to run the pipeline from **single cell raw fastq files** is:

```bash
nextflow run nf-core/airrflow -r dev \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
--mode fastq \
--input input_samplesheet.tsv \
--library_generation_method sc_10x_genomics \
--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
--outdir ./results
```

A typical command for running the pipeline departing from **single-cell AIRR rearrangement tables or assembled bulk sequencing fasta** data is:

```bash
Expand All @@ -49,7 +61,7 @@ nextflow run nf-core/airrflow \
--outdir results
```

Check the section [Input samplesheet](#input-samplesheet) below for instructions on how to create the samplesheet, and the [Supported library generation protocols](#supported-bulk-library-generation-methods-protocols) section below for examples on how to run the pipeline for different bulk sequencing protocols.
Check the section [Input samplesheet](#input-samplesheet) below for instructions on how to create the samplesheet, and the [Supported library generation protocols](#supported-bulk-library-generation-methods-protocols) section below for examples on how to run the pipeline for different bulk and the 10xGenomics single cell sequencing protocol.
For more information about the parameters, please refer to the [parameters documentation](https://nf-co.re/airrflow/parameters).
The command above will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.

Expand Down Expand Up @@ -111,7 +123,7 @@ If you wish to share such profile (such as upload as supplementary material for

## Input samplesheet

### Fastq input samplesheet (bulk sequencing only)
### Fastq input samplesheet (bulk sequencing)

The required input file for processing raw BCR or TCR bulk targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename_R1`, `filename_R2`, `subject_id`, `species`, `tissue`, `pcr_target_locus`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. An example samplesheet is:

Expand All @@ -131,7 +143,7 @@ The required input file for processing raw BCR or TCR bulk targeted sequencing d
- `biomaterial_provider`: Institution / research group that provided the samples.
- `sex`: Subject biological sex (`female`, `male`, etc.).
- `age`: Subject biological age.
- `single_cell`: TRUE or FALSE. Fastq input samplesheet only supports a FALSE value.
- `single_cell`: TRUE or FALSE.

Other optional columns can be added. These columns will be available when building the contrasts for the repertoire comparison report. It is recommended that these columns also follow the AIRR nomenclature. Examples are:

Expand All @@ -143,6 +155,25 @@ Other optional columns can be added. These columns will be available when buildi

The metadata specified in the input file will then be automatically annotated in a column with the same header in the tables generated by the pipeline.

### Fastq input samplesheet (single cell sequencing)

The required input file for processing raw BCR or TCR single cell targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename_R1`, `filename_R2`, `subject_id`, `species`, `tissue`, `pcr_target_locus`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. You can refer to the bulk fastq input section for documentation on the individual columns.
An example samplesheet is:

| sample_id | filename_R1 | filename_R2 | subject_id | species | pcr_target_locus | tissue | sex | age | biomaterial_provider | single_cell | intervention | collection_time_point_relative | cell_subset |
| --------- | ------------------------------- | ------------------------------- | ---------- | ------- | ---------------- | ------ | ------ | --- | -------------------- | ----------- | -------------- | ------------------------------ | ------------ |
| sample01 | sample1_S1_L001_R1_001.fastq.gz | sample1_S1_L001_R2_001.fastq.gz | Subject02 | human | IG | blood | NA | 53 | sequencing_facility | FALSE | Drug_treatment | Baseline | plasmablasts |
| sample02 | sample2_S1_L001_R1_001.fastq.gz | sample2_S1_L001_R2_001.fastq.gz | Subject02 | human | TR | blood | female | 78 | sequencing_facility | FALSE | Drug_treatment | Baseline | plasmablasts |

> FASTQ files must confirm the 10xGenomics cellranger naming conventions<br> >**`[SAMPLE-NAME]`_S1_L00`[LANE-NUMBER]` _`[READ-TYPE]`\_001.fastq.gz**
>
> Read type is one of
>
> - `I1`: Sample index read (optional)
> - `I2`: Sample index read (optional)
> - `R1`: Read 1
> - `R2`: Read 2
### Assembled input samplesheet (bulk or single-cell sequencing)

The required input file for processing raw BCR or TCR bulk targeted sequencing data is a sample sheet in TSV format (tab separated). The columns `sample_id`, `filename`, `subject_id`, `species`, `tissue`, `single_cell`, `sex`, `age` and `biomaterial_provider` are required. All fields are explained in the previous section, with the only difference being that there is only one `filename` column for the assembled input samplesheet. The provided file will be different from assembled single-cell or bulk data:
Expand Down Expand Up @@ -380,7 +411,7 @@ This sequencing type requires setting `--library_generation_method race_5p_umi`

#### Takara Bio SMARTer Human BCR

The read configuration when sequenicng with the TAKARA Bio SMARTer Human BCR protocol is the following:
The read configuration when sequencing with the TAKARA Bio SMARTer Human BCR protocol is the following:

![nf-core/airrflow](images/TAKARA_RACE_BCR.png)

Expand Down Expand Up @@ -449,6 +480,37 @@ The UMI barcodes are typically read from an index file but sometimes can be prov

- No UMIs in R1 or R2 reads: if no UMIs are present in the samples, specify `--umi_length 0` to use the sans-UMI subworkflow.

## Supported single cell library generation methods (protocols)

When processing single cell sequencing data departing from raw `fastq` reads, currently only a `--library_generation_method` to support 10xGenomics data is available.

| Library generation methods | Description | Name in pipeline | Commercial protocols |
| -------------------------- | ----------------------------------------------------------------------------------------------------------- | ---------------- | -------------------- |
| RT(RHP)+PCR | sequencing data produced from Chromium single cell 5'V(D)J libraries containing cellular barcodes and UMIs. | sc_10x_genomics | 10xGenomics |

### 10xGenomics

This sequencing type requires setting `--library_generation_method sc_10x_genomics`.
The `cellranger vdj` automatically uses the Chromium cellular barcodes and UMIs to perform sequence assembly, paired clonotype calling and to assemble V(D)J transcripts per cell.
Examples are provided below to run airrflow to process 10xGenomics raw FASTQ data.

```bash
nextflow run nf-core/airrflow -r dev \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
--mode fastq \
--input input_samplesheet.tsv \
--library_generation_method sc_10x_genomics \
--reference_10x reference/refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0.tar.gz \
--outdir ./results
```

#### 10xGenomics reference

10xGenomics requires a reference. This can be provided using the `--reference_10x` parameter.

- The 10xGenomics reference can be downloaded from the [download page](https://www.10xgenomics.com/support/software/cell-ranger/downloads)
- To generate a V(D)J segment fasta file as reference from IMGT one can follow the [cellranger docs](https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/advanced/references#imgt).

## Core Nextflow arguments

:::note
Expand Down
15 changes: 15 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,21 @@
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"cat/fastq": {
"branch": "master",
"git_sha": "02fd5bd7275abad27aad32d5c852e0a9b1b98882",
"installed_by": ["modules"]
},
"cellranger/mkvdjref": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"cellranger/vdj": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"custom/dumpsoftwareversions": {
"branch": "master",
"git_sha": "de45447d060b8c8b98575bc637a4a575fd0638e1",
Expand Down
29 changes: 29 additions & 0 deletions modules/local/unzip_cellrangerdb.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
process UNZIP_CELLRANGERDB {
tag "unzip_cellrangerdb"
label 'process_single'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/ubuntu:20.04' :
'nf-core/ubuntu:20.04' }"

input:
path(archive)

output:
path("$unzipped") , emit: unzipped
path "versions.yml", emit: versions

script:
unzipped = archive.toString() - '.tar.gz'
"""
echo "${unzipped}"
tar -xzvf ${archive}
cat <<-END_VERSIONS > versions.yml
"${task.process}":
unzip_cellrangerdb: \$(echo \$(tar --version 2>&1 | sed 's/^.*(GNU tar) //; s/ Copyright.*\$//')
END_VERSIONS
"""
}
7 changes: 7 additions & 0 deletions modules/nf-core/cat/fastq/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit afb1b48

Please sign in to comment.