Initial commit

biomedia-mira · Feb 16, 2024 · 8d4efa9 · 8d4efa9
commit 8d4efa9
Show file tree

Hide file tree

Showing 61 changed files with 35,359 additions and 0 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -0,0 +1,30 @@
+name: CI
+
+on:
+  push:
+    branches: main
+  pull_request:
+    branches: main
+
+jobs:
+  build-linux:
+    runs-on: ubuntu-latest
+    strategy:
+      max-parallel: 5
+    defaults:
+      run:
+        shell: bash -el {0}
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Conda
+      uses: conda-incubator/setup-miniconda@v2
+      with:
+          environment-file: environmentCI.yml
+          python-version: 3.11.0
+          auto-activate-base: true
+    - name: Lint with flake8
+      run: |
+        python -V
+        conda info
+         # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --extend-ignore=E203 --show-source --statistics --max-line-length=119
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,156 @@
+joined_simple.csv
+*.sh
+*.png
+*.pdf
+*.eps
+embed_cf/
+
+outputs
+znew_scripts
+padchest_cf_images_v0
+cf_beta1balanced_scanner
+cf_beta2balanced_scanner
+
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# custom gitignores
+**/outputs/
+**/test_outputs/
+**/.vscode/
+**/wandb/
+cifar-10-batches-py
+*.tar.gz
+causal-contrastive.code-workspace
+**/dscm_checkpoints/
+/data/
+MNIST/
+playground.py
diff --git a/README.md b/README.md
@@ -0,0 +1,60 @@
+# CF-SimCLR: counterfactual contrastive learning
+
+This repository contains the code for the paper "Counterfactual contrastive learning: domain-aligned features for improved robustness to acquisition shift".
+
+![alt text](figure1.png)
+
+## Overview
+The repository is divided in three main parts:
+* The [causal_model/](causal_models/) folder contains all code related to counterfactual inference model training. It contains its own README, giving you all necessary commands to train a DSCM on EMBED and PadChest.
+* The [classification/](classification/) folder contains all the code related to self-supervised training as well as finetuning for evaluation (see below).
+* The [data_handling/](data_handling/) folder contains everything you need to define your dataset classes. In particular, it contains all the boilerplate for CF-SimCLR specific data loading.
+* The [evaluation/](evaluation/) folder contains all the code related to test inference and results plotting for reproducing the plots from the paper. 
+
+
+## Prerequisites
+
+### Code dependencies
+The code is written in PyTorch, with PyTorch Lightning. 
+You can install all our dependencies using our conda enviromnent requirements file `environment_gpu.yml'. 
+
+### Datasets
+You will need to download the relevant datasets to run our code. 
+You can find the datasets at XXX, XXXX, XXX.
+Once you have downloaded the datasets, please update the corresponding paths at the top of the `mammo.py` and `xray.py` files.
+Additionally, for EMBED you will need to preprocess the original dataframes with our script `data_handling/csv_generation_code/generate_embed_csv.ipynb`. Similarly for RSNA please run first `data_handling/csv_generation_code/rsna_generate_full_csv.py`.
+
+
+## Full workflow example for training and evaluating CF-SimCLR
+Here we'll run through an example to train and evaluate CF-SimCLR on EMBED
+
+1. Train a counterfactual image generation model with 
+```
+python causal_models/main.py --hps embed
+```
+
+2. Generate and save all counterfactuals from every image in the training set with
+```
+python causal_models/save_embed_scanner_cf.py
+```
+
+3. Train the CF-SimCLR model
+``` 
+python classification/train.py experiment=simclr_embed data.use_counterfactuals=True counterfactual_contrastive=True
+```
+Alternatively to train a SimCLR baseline just run
+``` 
+python classification/train.py experiment=simclr_embed
+```
+Or to run the baseline with counterfactuals added to the training set without counterfactual contrastive objective
+``` 
+python classification/train.py experiment=simclr_embed data.use_counterfactuals=True counterfactual_contrastive=False
+```
+
+4. Train classifier with linear finetuning or finetuning
+```
+python classification/train.py experiment=base_density trainer.finetune_path=PATH_TO_ENCODER seed=33 trainer.freeze_encoder=True
+```
+You can choose the proportion of labelled data to use for finetuning with the flag `data.prop_train=1.0`
+
+5. Evaluate on the test set by running the notebook `evaluation/embed_density.ipynb` to run and save inference results on the test set. 
diff --git a/__init__.py b/__init__.py
diff --git a/causal_models/README.md b/causal_models/README.md
@@ -0,0 +1,14 @@
+# Code for counterfactual image generation
+
+The code in this folder is adapted from the official code associated with the  
+'High Fidelity Image Counterfactuals with Probabilistic Causal Models' paper. Original code: [https://github.com/biomedia-mira/causal-gen](https://github.com/biomedia-mira/causal-gen).
+
+## Train the counterfactual inference model
+
+To train the counterfactual inference models in the paper you can simply run
+`python causal_models/main.py --hps embed` replace by `padchest` if you want to train on chest x-rays. All associated hyperparameters are stored in `causal_models/hps.py`. 
+
+This assumes you have already set up your data folders as per the main repository.
+
+## Generating and saving counterfactuals for contrastive training
+To generate all possible domain counterfactuals given a trained model, you can use our predefined scripts: `save_embed_scanner_cf.py` and`save_padchest_scanner_cf.py`. Simply pass your checkpoint path and your target saving directory as command line arguments.
diff --git a/causal_models/__init__.py b/causal_models/__init__.py