Workflow: Single-cell RNA-seq analysis

Running the script

The main workflow is currently run from within Rstudio. Project-specific parameters are adapted in the project_parameters code chunk.

Arguments

Input

`project_id`

ID or name of the project

`path_data`

R data.frame with information on the counts datasets. Each row corresponds to one dataset, and each column to one of the following information:

name: Uunique identifier for a dataset
type: 10x for sparse matrix-like datasets as generated by the 10x pipeline Cell Ranger, or smartseq2 for non-sparse counts tables
path: Path to the respective counts directory or file
stats: Path to a mapping statistics file, e.g. "metrics_summary.csv" generated by Cell Ranger, but can be any table or NA if not available

For dataset type 10x, there should be a path to a directory with the following files:

features.tsv.gz: Tab-separated file with feature id, feature name and type
barcodes.tsv.gz: File with the cell names
matrix.mtx.gz: Sparse matrix file
metadata.tsv.gz: Tab-separated cell metadata file with the cell barcode in the first column (optional)

For dataset type smartseq2, there should be:

Gzipped tab-separated counts file with the first column containing the gene id and the remaining columns containing the counts for the cells
Additional cell metadata in a tab-separated metadata.tsv.gz file that contains the cell barcode in the first column and that must be in the same directory (optional)

For dataset type smartseq2, cell names can contain plate information in the following format <sample>_<plate_number>_<row><column> where sample is a string, plate_number a two digit number, row a capital letter and column a number. In this case, plate information will be parsed and cells will be grouped into multiple samples based on the sample string in their cell name.

When providing multiple datasets, cell names are expected to be unique. If not, they will be made unique by adding a numerical suffix. Furthermore, it is highly recommended to use Ensembl ids as gene ids. If not, then it should be made sure that the gene ids are unique and only contain only letters, digits, hyphens and dots.

`downsample_cells_n`

Downsample data to n cells for each sample (mainly for tests); set to NULL to deactivate

`path_out`

Path to an output directory (will be created if it does not exist)

`file_known_markers`

Path to an Excel file with marker genes based on literature. There should be one list per column with the first row as header and the gene ids listed below. The expression of known marker genes will be visualised in the report.

`mart_dataset`

When using Ensembl ids as gene ids, the name of the Ensembl dataset for further information on the genes . For human, this will be hsapiens_gene_ensembl, for mouse mmusculus_gene_ensembl and for zebrafish drerio_gene_ensembl. For other species, there will be an analogous name.

`annot_version`

When using Ensembl ids as gene ids, the Ensembl version to use. This should match the version that was used for obtaining the raw counts.

`annot_main`

When using Ensembl ids as gene ids, the argument is an R named list that specifies which Ensembl attributes will be used as gene id, as gene symbol and as entrez accession, respectively.

`mart_attributes`

Gene annotation attributes to include when fetching gene annotation from Ensembl

`biomart_mirror`

biomaRt mirror to query Ensembl (www, useast, uswest or asia; if NULL, the mirror will be automatically selected)

`file_annot`

Instead of fetching the gene annotation from Ensembl servers, read it from an external file. Note that by default the gene annotation will be fetched only once and is then saved in a separate file. Use this argument only if there is no access to Ensembl at all.

`mt`

Prefix of mitochondrial genes, usually ^MT- for human, ^Mt- for mouse and ^mt- for zebrafish. The caret indicates the prefix.

Filtering

`cell_filter`

R named list with filters for cells. Filters can be specified based on numerical and categorical column names in the cell metadata. Numerical filters must have two values indicating the minimum and maximum. If there is no minimum or maximum, the respective value can be set to NA. Categorial filters consist of a character vector of allowed values. The following cell metadata columns are computed by default:

nCount_RNA: Total number of counts calculated based on the gene expression data
nFeature_RNA: Total number of detected genes calculated based on the gene expression data
percent_mt: Percent mitochondrial content calculated based on the gene expression data
nCount_ERCC: Total number of counts calculated based on the ERCC spike-in data (if available)
nFeature_ERCC: Total number of ERCCs calculated based on the ERCC spike-in data (if available)
percent_ercc: Percent ERCC content calculated based on the ERCC and gene expression data (if available)
orig.ident: Name of the sample

By default, top-level filters will be applied to all samples. To apply sample-specific filters, the R named list can contain sublists with the sample names and the respective sample-specific filters, which will overwrite the top-level filters. The names of the samples correspond to the dataset names or alternatively, if a dataset contains multiple samples (SmartSeq-2) to the dataset and sample name separated by a dot.

Here are some examples. The filters will keep:

list(nFeature_RNA=c(200, NA), percent_mt=c(NA, 20)): cells with at least 200 genes expressed and at most a mitochondrial content of 20%
list(nFeature_RNA=c(200, NA), orig.ident=c("sampleA", "sampleB"")): cells of sample A and B with at least 200 genes expressed
list(sampleA=list(nFeature_RNA=c(200, NA)), sampleB=list(nFeature_RNA=c(2000, NA))): cells of sample A with at least 200 genes and cells of sample B with at least 2000 genes
list(nFeature_RNA=c(200, NA), sampleB=list(nFeature_RNA=c(2000, NA))): cells of sample B with at least 2000 genes and cells of all other samples with at least 200 genes

Filtering can also be done based on user-provided cell metadata. This can be useful for example for subclustering. In a first pass, the big dataset is clustered into general clusters. In a second pass, the cluster information is provided as additional metadata to analyse only cells of a specific cluster.

`feature_filter`

R named list to filters for genes in the gene expression assay. The filter min_counts specifies the minimum number of counts a gene must have to be considered expressed. The filter min_cells specifies the minimum number of cells that must express a gene. Similarly, to the cell_filter, there can be sample-specific filtering, i.e. the R named list can contain sublists with the sample names and the respective sample-specific filters.

Here are some examples. The filters will keep:

list(min_counts=1, min_cells=3): genes with at least one count expressed in at least three cells
list(sampleA=list(min_counts=1, min_cells=3), sampleB=list(min_counts=1, min_cells=10)): genes with at least one count and expressed in at least three cells for sampleA and at least 10 cells for sampleB
list(sampleB=list(min_cells=10), min_counts=1, min_cells=3): genes with at least one count and expressed in at least 10 cells for sampleB and at least three cells for all other samples

`samples_to_drop`

Names of samples to drop after initial QC. This can be useful to remove technical controls (e.g. NC/negative control). The names of the samples correspond to the dataset names or, if a dataset contains multiple samples (SmartSeq-2) to the dataset and sample name separated by a dot.

`samples_min_cells`

Minimum number of cells a sample must have. This can be useful to remove samples which are part of a dataset but to few cells for analysis.

Normalisation and integration, clustering and dimensionality reduction

`norm`

Normalisation method to use for the analysis. Can be:

RNA: Counts are scaled to 10,000 and then natural log-transformed (default method by Seurat). Works well for all kinds of single cell data.
SCT: SCTransform (Hafemeister 2019). Uses regularized negative binomial regression to account for varying gene expression levels. Expects molecule counts.

The SCT method is recommended for analyses where all datasets are derived from UMI-based technologies that generate molecule counts. If at least one dataset is derived from read-based methods such as SmartSeq-2, it is better to use the RNA method. Also, at the moment, it is not possible to use different, sample-specific normalisation methods.

`cc_remove`

Whether to remove the effect of cell cycle genes prior to dimensionality reduction, visualisation and clustering. Gene expression in cycling cells can be dominated by a set of cell cycle-related genes which will mask the true relationship between cells. By default, a cell cycle score will be calculated for each cell based on a defined list of known cell cycle marker genes and the cell will be classified into G1, G2M or S phase. If TRUE, this information will then be used to regress out (that is remove) potential cell cycle-related contributions for each gene prior to further analysis. Note that this will likely work only for human and mouse.

`cc_remove_all`

If TRUE, cell cycle-related contributions for a gene will be assessed on the assigned cell cycle phases of the cells and all differences that likely originate from being in different cell cycle phases will be removed. However, when proliferation is an import aspect of the dataset, for example when analysing a developmental trajectory, this removal may mask important structures in the dataset. In this case, a more sensitive approach is to remove only the difference between proliferation cells in G2M and S phase (set to FALSE). Please read https://satijalab.org/seurat/v3.1/cell_cycle_vignette.html for more explanations.

`cc_rescore_after_merge`

When there are multiple datasets, should cell cycle-effects be scored on the individual datasets (FALSE) or on the combined dataset (TRUE). In general, cell cycle-scores are more reliable when calculated with more cells. However, there have been cases where calculation based on individual datasets worked better.

`vars_to_regress`

Additional variables that need to be regressed out prior to dimensionality reduction, visualisation and clustering. This can be useful when different individuals introduce an unwanted batch effect. Continuous variables such as the sequencing depth can be used as well. Variables must be names of the cell metadata.

`integrate_samples`

When there are multiple datasets, how to integrate and combine them. R named list with:

method:
- single: Default, if there is only one dataset after filtering and no integration is needed
- merge: Datasets are merged (concatenated), and then processed as a whole. Since this is the least "invasive" method, it is recommended to start with this method and then to check whether there are unwanted sample-specific effects.
- standard: This is the standard method for integrating samples as implemented by Seurat. Anchors are computed for all pairs of datasets based on a set of variable (hence informative) genes. These anchors are then used to sensibly group cells of different samples and to harmonize the gene expression between the samples based on the correct grouping.
- reference: One dataset is used as reference and anchors are computed for all other datasets. This method is less accurate but computationally faster (not yet implemented).
- reciprocal: Anchors are computed in PCA space instead based on the actual data. This method is even less accurate but computationally faster especially for very big datasets (not yet implemented).
reference_dataset: When using the method reference, which dataset is the reference? Can be numeric or name of the dataset.
dimensions: Number of dimensions to consider for integration. More dimensions will cause integration to be more pronounced on details.

Note that the integration methods standard, reference and reciprocal are only done on a set of variable (hence informative) genes, which will later be used for dimensionality reduction, visualisation and clustering. Finding marker genes or identifying genes with differential expression will still be done on the full non-integrated dataset but will include the sample information as additonal factor to consider. The integrated data will be stored in a separate assay.

`pc_n`

The number of principle components (PCs) used for dimensionaly reduction. After normalisation, scaling and integration (if requested), a principle component analysis (PCA) is done. To denoise the dataset and to reduce the complexity of the analysis, only the first n components are kept for further analysis such as clustering and visualisation. This parameter can be adjusted based on the Elbow plot in the report, which shows the percentage of variation explained by the PCs. A good number of PCs can be estimated based on where the plot reaches a plateau phase such that further PCs contribute only marginally to the variation.

`cluster_resolution`

The resolution of the clustering algorithm. This controls the granularity of the clustering, i.e. lower values will lead to fewer clusters and higher values will lead to more clusters. Usually between 0.2 and 2 (but can be higher).

Marker genes and genes with differential expression

`marker_padj`

Adjusted p-value for defining a marker gene

`marker_log2FC`

Minimum absolute log2 fold change for defining a marker gene

`marker_pct`

Minimum fraction of cells expressing a marker gene

`latent_vars`

Confounding variables to adjust for in finding markers and differentially expressed genes

`deg_contrasts`

Contrasts for testing differential expression. Can be a table in form of an R data.frame or an Excel file. The table needs to have the following columns:

condition_column: The categorial column in the cell metadata to test. Special columns are orig.ident for sample, seurat_clusters for cluster and Phase cell cycle phase.
condition_group1: The condition level(s) in group 1 (nominator). The group can contain multiple levels concatenated by the + character. An empty string means that all levels not in group 2 will be used.
condition_group2: The condition level(s) in group 2 (denominator). The group can contain multiple levels concatenated by the + character. An empty string means that all levels not in group 1 will be used.

These columns are optional:

subset_column: Prior to testing, subset cells based on this cell metadata column. Special columns are orig.ident for sample, seurat_clusters for cluster and Phase cell cycle phase. Must be used together with subset_group (default: none).
subset_group: Levels to subset before testing. Multiple levels to be analysed can be given separate by semicolons. An empty string means that all levels will be analysed. Levels can be concatenated with the + character. Must be used together with subset_column (default: none).
assay: Assay or dimensionality reduction to test on (default: RNA).
slot: When an assay is selected, which slot to use:
- raw counts: Default when test is one of negbinom, poisson or DESeq2
- normalised data: Default for all other tests
- normalised and scaled scale.data: Should not be used
padj: Maximum adjusted p-value (default: 0.05)
log2FC: Minimum absolute log2 fold change (default: 0)
min_pct: Minimum percentage of cells that need to express a gene in each group (default: 0.1)
test: Test to use. Can be:
- wilcox: Wilcoxon Rank Sum test (default)
- bimod: Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)
- roc: Identifies markers of gene expression using ROC analysis.
- t: t-test
- negbinom: Tests based on a negative binomial generalized linear model
- poisson: Tests based on a poisson generalized linear model (only UMI datasets)
- LR: Tests based on a logistic regression model
- MAST: Uses a hurdle model tailored to scRNA-seq data implemented in the the MAST package
- DESeq2: Tests using the DESeq2 package
- For more information on the tests, please see the documentation on the FindMarkers function of the Seurat R package
downsample_cells_n: Downsample each group to at most n cells to speed up tests
latent_vars: Additional variables to account for when testing (e.g. batches). Can be one or more cell metadata columns, can contain multiple column names concatenated by semicolons.

Here are some examples for a better understanding:

Compare cluster 1 versus cluster 2:
- condition_column: seurat_clusters
- condition_group1: 1
- condition_group2: 2
Compare cluster 1 versus cluster 2 and cluster 3 combined :
- condition_column: seurat_clusters
- condition_group1: 1
- condition_group2: 2+3
Compare cluster 1 versus rest:
- condition_column: seurat_clusters
- condition_group1: 1
- condition_group2:
Compare all G1 cells versus all S cells:
- condition_column: Phase
- condition_group1: G1
- condition_group2: S
Compare G1 cells versus S cells for cluster 1:
- condition_column: Phase
- condition_group1: G1
- condition_group2: S
- subset_column: seurat_clusters
- subset_group: 1
Compare G1 cells versus S cells, separately for cluster 1, 2, and 3:
- condition_column: Phase
- condition_group1: G1
- condition_group2: S
- subset_column: seurat_clusters
- subset_group: 1;2;3
- this will result in 3 tests
Compare G1 cells versus S cells, combined for cluster 1+2+3:
- condition_column: Phase
- condition_group1: G1
- condition_group2: S
- subset_column: seurat_clusters
- subset_group: 1+2+3
- this will result in 1 test
Compare G1 cells versus S cells, combined for cluster 1+2+3, and separately for cluster 4:
- condition_column: Phase
- condition_group1: G1
- condition_group2: S
- subset_column: seurat_clusters
- subset_group: 1+2+3;4
- this will result in 2 tests

`enrichr_padj`

P-value threshold for functional enrichment tests by Enrichr

`enrichr_dbs`

Enrichr databases for functional enrichment tests. Please see the corresponding table in the HTML output.

General

`col`

Main colour used for continuous data

`col_palette_samples`

Colour palette used for samples. Ideally, this should be one of ggsci palettes (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html), but it can also be another palette, as long as it is a function that provides a colour for a sample number. Make sure that the palette provides enough colours.

`col_palette_clusters`

Colour palette used for clusters. Ideally, this should be one of ggsci palettes (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html), but it can also be another palette, as long as it is a function that provides a colour for a cluster number. Make sure that the palette provides enough colours.

`path_to_git`

Path to the local git repository

`default_debugging`

Debugging mode: "default_debugging" for default, "terminal_debugger" for debugging without X11, "print_traceback" for non-interactive sessions

Files

usage_workflow.md

Latest commit

History

usage_workflow.md

File metadata and controls

Workflow: Single-cell RNA-seq analysis

Running the script

Arguments

Input

project_id

path_data

downsample_cells_n

path_out

file_known_markers

mart_dataset

annot_version

annot_main

mart_attributes

biomart_mirror

file_annot

mt

Filtering

cell_filter

feature_filter

samples_to_drop

samples_min_cells

Normalisation and integration, clustering and dimensionality reduction

norm

cc_remove

cc_remove_all

cc_rescore_after_merge

vars_to_regress

integrate_samples

pc_n

cluster_resolution