A NextFlow wrapper for the digital pathology quality control tool HistoQC.
Developed for the Multi-Consortia Coordinating (MC2) Center administrative supplement "Assuring AI/ML-readiness of digital pathology in diverse existing and emerging multi-omic datasets through quality control workflows" (3U24CA274494-02S2).
The project will improve the AI/ML readiness of existing and emerging NIH-supported digital pathology public datasets, and research programs supported by the MC2 Center, by automatically evaluating and reporting artifacts and batch effects using open-source NIH-funded tools. These enriched datasets will enable researchers to exclude artifacts from their training and validation sets in a reproducible manner, providing greater trust in cross-investigator dataset reuse while enhancing AI/ML model performance and robustness. To quantitatively demonstrate the provided value-add of cleaned AI/ML-ready data in downstream tasks, a prototypical deep learning use case is planned.
nextflow run mc2-center/nf-histoqc \
--input <path-to-samplesheet> \
--outDir <path-to-output-directory> \
--config <HistoQC config to use>
--profile local
To test on CMU-1-Small-Region.svs
(included in repo) and output to ./outputs
nextflow run mc2-center/nf-histoqc -profile test
nf-histoqc
takes a CSV samplesheet containing the following columns
image
: [string] Path or URI to image to be processed
Other columns may be provided but are not used by the pipeline.
nf-histoqc
outputs the following directory structure into the specified output directory (outDir
):
├── <outDir>
│ ├── results.tsv
│ ├── <baseName for first row of samplesheet>
│ │ ├── *.png <masks and images generated by HistoQC>
│ │ ├── ...
│ ├── <baseName for n'th row of samplesheet>
input
: Path to a CSV sample sheet. This parameter is required.outDir
: Specifies the directory where the output data should be saved. Default isoutputs
.
config
(string): Name of a built-in configuration used by HistoQC. Must be one ofdefault
,ihc
,clinical
,first
,light
, orv2.1
. Defaults todefault
.custom_config
(path): Path to a HistoQC compatible configuration file. Must have a.ini
extension. Overridesconfig
.convert
(bool): If provided,vips
is used to create an OpenSlide compatiable TIFF file. Uses mc2-center/histoqc-openslide-converter.
test
: Runs test samplesheet intest_data/test_samplesheet.csv
sage
: Optimized configuration for Sage's Nextflow Tower instance.local
: Low resources suitable for runs on laptops etc.tower
: Minimal configuration for Nextflow Tower.
A docker container is provided for reproducibility and hosted on ghcr.io. The image is rebuilt in GitHub actions whenver the Dockerfile or build and deploy actions are modified.
The Dockerfile is based on that provided in the HistoQC repo, with the addition of procps
and modification of some container settings to allow us in Nextflow Tower.
The container is automatically pulled by NextFlow, but if local use is required you can use:
docker pull ghcr.io/mc2-center/nf-histoqc:latest
A Nextflow pipeline is implicitly modelled by a direct acyclic graph (DAG). The vertices in the graph represent the pipeline’s processes and operators, while the edges represent the data connections (i.e. channels) between them.
flowchart TB
subgraph " "
v0["Channel.fromPath"]
v3["Channel.fromPath"]
v6["config_string"]
end
subgraph NF_HISTOQC
subgraph RUN
v5([CONVERT])
v7([HISTOQC])
v1(( ))
v4(( ))
v9(( ))
v13(( ))
end
subgraph COLLECT
v10([RESULTS])
v11([TIDY])
v14([LOGS])
end
end
subgraph " "
v8["output"]
v12[" "]
v15[" "]
end
v0 --> v1
v3 --> v4
v1 --> v5
v5 --> v7
v6 --> v7
v4 --> v7
v7 --> v8
v7 --> v9
v7 --> v13
v9 --> v10
v10 --> v11
v11 --> v12
v13 --> v14
v14 --> v15