Skip to content


Repository files navigation

Immune Heterogeneity (iHet) in NSCLC

This repository holds the code for all analyses related to


All analyses are integrated into a nextflow pipeline and all dependencies are packaged as singularity containers. The pipeline consists of the following subworkflows

  • MOFA. Run unsupervised multi-omics factorial analysis (MOFA) of bulk RNA-seq data from Jia, Sharma, TRACERx and pan-cancer datasets from TCGA
  • single-cell. Create a custom subset of the Lung Cancer Atlas, re-annotate myeloid subtypes and perform transcription factor and pathway analysis.

Launching the workflow


Obtaining data

Before launching the workflow, you need to obtain input data and singularity containers from zenodo. First of all, clone this repository:

git clone
cd iHet

Then, within the repository, download the data archives and extract then to the corresponding directories:

# singularity containers
curl -L "" | tar xvJ

# input data
curl -L "" | tar xvJ

Additionally, you can obtain the pre-computed results without running the workflow using

curl -L "" | tar xvJ

By default, the workflow is configured to use the pre-computed single-cell annotations, since graph-based clustering is not guaranteed to be 100% reproducible on other systems.

Note that some results depend on the TRACERx data (EGAS00001003458, EGAD00001003206) which is not publicly available. The workflow is configured, by default, to run without these data.

Briefly, the input data contains

  • Gene expression data for each dataset
  • Tumor mutational burden data for each dataset
  • Single-cell data (LuCA version 2022.05.10)

Run nextflow

# newer versions of nextflow are incompatible with the workflow. By setting this variable
# the correct version will be used automatically.
export NXF_VER=22.04.5

nextflow run --outdir data/results

Structure of this repository

  • analyses: Place for e.g. jupyter/rmarkdown notebooks, gropued by their respective (sub-)workflows.
  • containers: place for singularity image files. Not part of the git repo and gets created by the download command.
  • data: place for input data and results in different subfolders. Gets populated by the download commands and by running the workflows.
  • lib: custom libraries and helper functions
  • modules: nextflow DSL2.0 modules
  • subworkflows: nextflow subworkflows
  • tables: contains static content that should be under version control (e.g. manually created tables)

Output documentation

The analysis pipeline generates the following directory structure:


In this section, we describe the directories and their contents in more detail.


Results of the easier package to obtain cell-type fraction, pathway- and transcription factor estimates. Contains one .rds file per dataset group, which contains a list of lists of data frames. There is one data frame for each dataset and each modality. Contained modalities:

"count"    "tpm"      "response" "pathway"  "tf"       "cellfrac" "immresp"


Result of an Rmarkdown notebook preparing the input data into a format compatible with MOFA. Splits up the data into all individual dataset and creates bootstrap datasets for each dataset. A rendered version of the notebook is in the main directory, all generated files are in the artifacts directory:

  • data_all_tidy.rds: All modalities and datasets from 11_easier merged into a single, "tidy" data frame. Some features are renamed, some are removed and some are merged.
  • mofa_*.rds: The tidy data from above split up by dataset, with some additional scaling.
  • mofa_boot_*.rds: Same structure as mofa_*.rds, but with datasets resampled by bootstrapping. Each file represents a different boostrap.


Results of MOFA after applying it to the datasets generated in the previous step. Contains hdf5 datasets for each dataset holding the mofa models.


Results of an Rmarkdown notebook with the analysis of the MOFA results. A rendered version of the notebook is in the main directory, all generated files are in the artifacts directory:

  • median_factors.rds: Median factors across all boostraps for each dataset. Contains a list of dataframes (one for each dataset).
  • median_weights.rds: Median weights (for each feature) across all bootstraps for each dataset. Contains a list of data frames (one for each dataset)
  • *_factor_correlations{,_pvalues}.tsv: Pearson correlations between F1 and F1-F3 between the different datasets and the associated p-values.
  • plots/: Various plots


Results of an Rmarkdown notebook with the analysis of iHet predictions. A rendered version of the notebook is in the main directory, all generated files are in the artifacts directory:

  • *_bootstrap_iHet_scores.rds: Computed boostraped iHet scores for the four ICB-cohorts: 1. Non-small cell lung cancer (NSCLC, Jung cohort), 2. Melanoma (SKCM, combined Gide and Auslander), 3. Bladder urothelial carcinoma (BLCA, Mariathasan cohort) and 4. Stomach adenocarcinoma (STAD, Kim cohort).


Subset the full LuCA atlas to only contain primary tumor samples from either LUAD or LUSC. Creates two h5ad files:

  • adata_m.h5ad: Subset of myeloid cell-types
  • adata_nsclc.h5ad: The full custom subset of the atlas.


Re-annotate myeloid cell-types at a better resolution than LuCA. Generates updated h5ad files:

  • adata_nsclc_reannotated.h5ad
  • adata_myeloid_reannotated.h5ad


Execute Dorothea and Progeny on the single-cell data. Generates heatmaps and summary tables.


Generate single-cell related figures for the manuscript.


For reproducibility issues or any other requests regarding single-cell data analysis, please use the issue tracker. For anything else, you can reach out to the corresponding author(s) as indicated in the manuscript.