The main workflow is currently run from within Rstudio. Project-specific parameters are adapted in the project_parameters
code chunk.
ID or name of the project
R data.frame
with information on the counts datasets. Each row corresponds to one dataset, and each column to one of the following information:
name
: Uunique identifier for a datasettype
:10x
for sparse matrix-like datasets as generated by the 10x pipeline Cell Ranger, orsmartseq2
for non-sparse counts tablespath
: Path to the respective counts directory or filestats
: Path to a mapping statistics file, e.g. "metrics_summary.csv" generated by Cell Ranger, but can be any table orNA
if not available
For dataset type 10x
, there should be a path to a directory with the following files:
- features.tsv.gz: Tab-separated file with feature id, feature name and type
- barcodes.tsv.gz: File with the cell names
- matrix.mtx.gz: Sparse matrix file
- metadata.tsv.gz: Tab-separated cell metadata file with the cell barcode in the first column (optional)
For dataset type smartseq2
, there should be:
- Gzipped tab-separated counts file with the first column containing the gene id and the remaining columns containing the counts for the cells
- Additional cell metadata in a tab-separated metadata.tsv.gz file that contains the cell barcode in the first column and that must be in the same directory (optional)
For dataset type smartseq2
, cell names can contain plate information in the following format <sample>_<plate_number>_<row><column>
where sample
is a string, plate_number
a two digit number, row
a capital letter and column
a number. In this case, plate information will be parsed and cells will be grouped into multiple samples based on the sample
string in their cell name.
When providing multiple datasets, cell names are expected to be unique. If not, they will be made unique by adding a numerical suffix. Furthermore, it is highly recommended to use Ensembl ids as gene ids. If not, then it should be made sure that the gene ids are unique and only contain only letters, digits, hyphens and dots.
Downsample data to n
cells for each sample (mainly for tests); set to NULL to deactivate
Path to an output directory (will be created if it does not exist)
Path to an Excel file with marker genes based on literature. There should be one list per column with the first row as header and the gene ids listed below. The expression of known marker genes will be visualised in the report.
When using Ensembl ids as gene ids, the name of the Ensembl dataset for further information on the genes . For human, this will be hsapiens_gene_ensembl
, for mouse mmusculus_gene_ensembl
and for zebrafish drerio_gene_ensembl
. For other species, there will be an analogous name.
When using Ensembl ids as gene ids, the Ensembl version to use. This should match the version that was used for obtaining the raw counts.
When using Ensembl ids as gene ids, the argument is an R named list that specifies which Ensembl attributes will be used as gene id, as gene symbol and as entrez accession, respectively.
Gene annotation attributes to include when fetching gene annotation from Ensembl
biomaRt
mirror to query Ensembl (www
, useast
, uswest
or asia
; if NULL
, the mirror will be automatically selected)
Instead of fetching the gene annotation from Ensembl servers, read it from an external file. Note that by default the gene annotation will be fetched only once and is then saved in a separate file. Use this argument only if there is no access to Ensembl at all.
Prefix of mitochondrial genes, usually ^MT-
for human, ^Mt-
for mouse and ^mt-
for zebrafish. The caret indicates the prefix.
R named list with filters for cells. Filters can be specified based on numerical and categorical column names in the cell metadata. Numerical filters must have two values indicating the minimum and maximum. If there is no minimum or maximum, the respective value can be set to NA
. Categorial filters consist of a character vector of allowed values. The following cell metadata columns are computed by default:
nCount_RNA
: Total number of counts calculated based on the gene expression datanFeature_RNA
: Total number of detected genes calculated based on the gene expression datapercent_mt
: Percent mitochondrial content calculated based on the gene expression datanCount_ERCC
: Total number of counts calculated based on the ERCC spike-in data (if available)nFeature_ERCC
: Total number of ERCCs calculated based on the ERCC spike-in data (if available)percent_ercc
: Percent ERCC content calculated based on the ERCC and gene expression data (if available)orig.ident
: Name of the sample
By default, top-level filters will be applied to all samples. To apply sample-specific filters, the R named list can contain sublists with the sample names and the respective sample-specific filters, which will overwrite the top-level filters. The names of the samples correspond to the dataset names or alternatively, if a dataset contains multiple samples (SmartSeq-2) to the dataset and sample name separated by a dot.
Here are some examples. The filters will keep:
list(nFeature_RNA=c(200, NA), percent_mt=c(NA, 20))
: cells with at least 200 genes expressed and at most a mitochondrial content of 20%list(nFeature_RNA=c(200, NA), orig.ident=c("sampleA", "sampleB""))
: cells of sample A and B with at least 200 genes expressedlist(sampleA=list(nFeature_RNA=c(200, NA)), sampleB=list(nFeature_RNA=c(2000, NA)))
: cells of sample A with at least 200 genes and cells of sample B with at least 2000 geneslist(nFeature_RNA=c(200, NA), sampleB=list(nFeature_RNA=c(2000, NA)))
: cells of sample B with at least 2000 genes and cells of all other samples with at least 200 genes
Filtering can also be done based on user-provided cell metadata. This can be useful for example for subclustering. In a first pass, the big dataset is clustered into general clusters. In a second pass, the cluster information is provided as additional metadata to analyse only cells of a specific cluster.
R named list to filters for genes in the gene expression assay. The filter min_counts
specifies the minimum number of counts a gene must have to be considered expressed. The filter min_cells
specifies the minimum number of cells that must express a gene. Similarly, to the cell_filter
, there can be sample-specific filtering, i.e. the R named list can contain sublists with the sample names and the respective sample-specific filters.
Here are some examples. The filters will keep:
list(min_counts=1, min_cells=3)
: genes with at least one count expressed in at least three cellslist(sampleA=list(min_counts=1, min_cells=3), sampleB=list(min_counts=1, min_cells=10))
: genes with at least one count and expressed in at least three cells for sampleA and at least 10 cells for sampleBlist(sampleB=list(min_cells=10), min_counts=1, min_cells=3)
: genes with at least one count and expressed in at least 10 cells for sampleB and at least three cells for all other samples
Names of samples to drop after initial QC. This can be useful to remove technical controls (e.g. NC/negative control). The names of the samples correspond to the dataset names or, if a dataset contains multiple samples (SmartSeq-2) to the dataset and sample name separated by a dot.
Minimum number of cells a sample must have. This can be useful to remove samples which are part of a dataset but to few cells for analysis.
Normalisation method to use for the analysis. Can be:
RNA
: Counts are scaled to 10,000 and then natural log-transformed (default method by Seurat). Works well for all kinds of single cell data.SCT
: SCTransform (Hafemeister 2019). Uses regularized negative binomial regression to account for varying gene expression levels. Expects molecule counts.
The SCT
method is recommended for analyses where all datasets are derived from UMI-based technologies that generate molecule counts. If at least one dataset is derived from read-based methods such as SmartSeq-2, it is better to use the RNA
method. Also, at the moment, it is not possible to use different, sample-specific normalisation methods.
Whether to remove the effect of cell cycle genes prior to dimensionality reduction, visualisation and clustering. Gene expression in cycling cells can be dominated by a set of cell cycle-related genes which will mask the true relationship between cells. By default, a cell cycle score will be calculated for each cell based on a defined list of known cell cycle marker genes and the cell will be classified into G1, G2M or S phase. If TRUE
, this information will then be used to regress out (that is remove) potential cell cycle-related contributions for each gene prior to further analysis. Note that this will likely work only for human and mouse.
If TRUE
, cell cycle-related contributions for a gene will be assessed on the assigned cell cycle phases of the cells and all differences that likely originate from being in different cell cycle phases will be removed. However, when proliferation is an import aspect of the dataset, for example when analysing a developmental trajectory, this removal may mask important structures in the dataset. In this case, a more sensitive approach is to remove only the difference between proliferation cells in G2M and S phase (set to FALSE
). Please read https://satijalab.org/seurat/v3.1/cell_cycle_vignette.html for more explanations.
When there are multiple datasets, should cell cycle-effects be scored on the individual datasets (FALSE
) or on the combined dataset (TRUE
). In general, cell cycle-scores are more reliable when calculated with more cells. However, there have been cases where calculation based on individual datasets worked better.
Additional variables that need to be regressed out prior to dimensionality reduction, visualisation and clustering. This can be useful when different individuals introduce an unwanted batch effect. Continuous variables such as the sequencing depth can be used as well. Variables must be names of the cell metadata.
When there are multiple datasets, how to integrate and combine them. R named list with:
method
:single
: Default, if there is only one dataset after filtering and no integration is neededmerge
: Datasets are merged (concatenated), and then processed as a whole. Since this is the least "invasive" method, it is recommended to start with this method and then to check whether there are unwanted sample-specific effects.standard
: This is the standard method for integrating samples as implemented by Seurat. Anchors are computed for all pairs of datasets based on a set of variable (hence informative) genes. These anchors are then used to sensibly group cells of different samples and to harmonize the gene expression between the samples based on the correct grouping.reference
: One dataset is used as reference and anchors are computed for all other datasets. This method is less accurate but computationally faster (not yet implemented).reciprocal
: Anchors are computed in PCA space instead based on the actual data. This method is even less accurate but computationally faster especially for very big datasets (not yet implemented).
reference_dataset
: When using the methodreference
, which dataset is the reference? Can be numeric or name of the dataset.dimensions
: Number of dimensions to consider for integration. More dimensions will cause integration to be more pronounced on details.
Note that the integration methods standard
, reference
and reciprocal
are only done on a set of variable (hence informative) genes, which will later be used for dimensionality reduction, visualisation and clustering. Finding marker genes or identifying genes with differential expression will still be done on the full non-integrated dataset but will include the sample information as additonal factor to consider. The integrated data will be stored in a separate assay.
The number of principle components (PCs) used for dimensionaly reduction. After normalisation, scaling and integration (if requested), a principle component analysis (PCA) is done. To denoise the dataset and to reduce the complexity of the analysis, only the first n
components are kept for further analysis such as clustering and visualisation. This parameter can be adjusted based on the Elbow plot in the report, which shows the percentage of variation explained by the PCs. A good number of PCs can be estimated based on where the plot reaches a plateau phase such that further PCs contribute only marginally to the variation.
The resolution of the clustering algorithm. This controls the granularity of the clustering, i.e. lower values will lead to fewer clusters and higher values will lead to more clusters. Usually between 0.2 and 2 (but can be higher).
Adjusted p-value for defining a marker gene
Minimum absolute log2 fold change for defining a marker gene
Minimum fraction of cells expressing a marker gene
Confounding variables to adjust for in finding markers and differentially expressed genes
Contrasts for testing differential expression. Can be a table in form of an R data.frame
or an Excel file. The table needs to have the following columns:
condition_column
: The categorial column in the cell metadata to test. Special columns areorig.ident
for sample,seurat_clusters
for cluster andPhase
cell cycle phase.condition_group1
: The condition level(s) in group 1 (nominator). The group can contain multiple levels concatenated by the+
character. An empty string means that all levels not in group 2 will be used.condition_group2
: The condition level(s) in group 2 (denominator). The group can contain multiple levels concatenated by the+
character. An empty string means that all levels not in group 1 will be used.
These columns are optional:
subset_column
: Prior to testing, subset cells based on this cell metadata column. Special columns areorig.ident
for sample,seurat_clusters
for cluster andPhase
cell cycle phase. Must be used together withsubset_group
(default: none).subset_group
: Levels to subset before testing. Multiple levels to be analysed can be given separate by semicolons. An empty string means that all levels will be analysed. Levels can be concatenated with the+
character. Must be used together withsubset_column
(default: none).assay
: Assay or dimensionality reduction to test on (default:RNA
).slot
: When an assay is selected, which slot to use:- raw
counts
: Default whentest
is one ofnegbinom
,poisson
orDESeq2
- normalised
data
: Default for all other tests - normalised and scaled
scale.data
: Should not be used
- raw
padj
: Maximum adjusted p-value (default: 0.05)log2FC
: Minimum absolute log2 fold change (default: 0)min_pct
: Minimum percentage of cells that need to express a gene in each group (default: 0.1)test
: Test to use. Can be:wilcox
: Wilcoxon Rank Sum test (default)bimod
: Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)roc
: Identifies markers of gene expression using ROC analysis.t
: t-testnegbinom
: Tests based on a negative binomial generalized linear modelpoisson
: Tests based on a poisson generalized linear model (only UMI datasets)LR
: Tests based on a logistic regression modelMAST
: Uses a hurdle model tailored to scRNA-seq data implemented in the the MAST packageDESeq2
: Tests using the DESeq2 package- For more information on the tests, please see the documentation on the
FindMarkers
function of the Seurat R package
downsample_cells_n
: Downsample each group to at most n cells to speed up testslatent_vars
: Additional variables to account for when testing (e.g. batches). Can be one or more cell metadata columns, can contain multiple column names concatenated by semicolons.
Here are some examples for a better understanding:
- Compare cluster 1 versus cluster 2:
condition_column
:seurat_clusters
condition_group1
:1
condition_group2
:2
- Compare cluster 1 versus cluster 2 and cluster 3 combined :
condition_column
:seurat_clusters
condition_group1
:1
condition_group2
:2+3
- Compare cluster 1 versus rest:
condition_column
:seurat_clusters
condition_group1
:1
condition_group2
:
- Compare all G1 cells versus all S cells:
condition_column
:Phase
condition_group1
:G1
condition_group2
:S
- Compare G1 cells versus S cells for cluster 1:
condition_column
:Phase
condition_group1
:G1
condition_group2
:S
subset_column
:seurat_clusters
subset_group
:1
- Compare G1 cells versus S cells, separately for cluster 1, 2, and 3:
condition_column
:Phase
condition_group1
:G1
condition_group2
:S
subset_column
:seurat_clusters
subset_group
:1;2;3
- this will result in 3 tests
- Compare G1 cells versus S cells, combined for cluster 1+2+3:
condition_column
:Phase
condition_group1
:G1
condition_group2
:S
subset_column
:seurat_clusters
subset_group
:1+2+3
- this will result in 1 test
- Compare G1 cells versus S cells, combined for cluster 1+2+3, and separately for cluster 4:
condition_column
:Phase
condition_group1
:G1
condition_group2
:S
subset_column
:seurat_clusters
subset_group
:1+2+3;4
- this will result in 2 tests
P-value threshold for functional enrichment tests by Enrichr
Enrichr databases for functional enrichment tests. Please see the corresponding table in the HTML output.
Main colour used for continuous data
Colour palette used for samples. Ideally, this should be one of ggsci palettes (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html), but it can also be another palette, as long as it is a function that provides a colour for a sample number. Make sure that the palette provides enough colours.
Colour palette used for clusters. Ideally, this should be one of ggsci palettes (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html), but it can also be another palette, as long as it is a function that provides a colour for a cluster number. Make sure that the palette provides enough colours.
Path to the local git repository
Debugging mode: "default_debugging" for default, "terminal_debugger" for debugging without X11, "print_traceback" for non-interactive sessions