Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor_Dec24ProjUpdates #153

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Binary file added docs/images/MortalityHypergraohs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/hypergraphs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/medcat.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/medcat2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/mmfair.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/nhssynthmodules.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pets.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pm2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/spaghetti.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 27 additions & 0 deletions docs/our_work/p34_hypergraphs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
layout: base
title: Transforming Healthcare Data With Graph-Based Techniques
permalink: p34_hypergraphs.html
summary: Application of directed hypergraphs for SAIL data bank to investigate disease progression
tags: ['RESEARCH', ''TIME SERIES', 'MODELLING', 'PRIMARY CARE']
---

<figure markdown>
![](../images/hypergraphs.png)
</figure>
<figcaption>Figure 1 form https://www.medrxiv.org/content/10.1101/2023.08.31.23294903v1.full.pdf: Four different types of unweighted, fully connected graph models with 3 nodes. (A) undirected graph, (B) undirected hypergraph, (C) directed graph, (D) directed hypergraph of B-hyperarcs, the only type of hyperarc considered in this work. Dotted lines here represent nodes as part of the tail set in each hyperarc. Edges have been colour coded to help identify children from their parents — looking top-down, we can observe how the directed edges (children) are generated from their corresponding undirected edges (parents). Note also the existence of self-edges in the directed representations.</figcaption>

In this project we explored (directed) hypergraphs as a novel tool for assessing the temporal relationships between coincident diseases, addressing the need for a more accurate representation of multimorbidity. Directed hypergraphs offer a high-order analytical framework that goes beyond the limitations of directed graphs in representing complex relationships. After exploring novel weighting schemes which can capture different aspects of the underlying data, we then turn our attention at the power of these higher-order models through the use of PageRank centrality to detect and classify the temporal nature of conditions.

## Results
See the associated publication and report for detailed learning around applying these techniques to Charlson indexed data to explore disease progression. This work then seeded two further PhD Internships exploring the addition of temporal information and alternative graph representations.

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [github](https://github.com/nhsx/hypergraph-mm) |
| Case Study | Awaiting Sign-Off |
| Technical report | [Report](https://github.com/nhsx/hypergraph-mm/blob/main/reports/Hypergraph_mm_report_JB.pdf) |
| Publication | [Representing Multimordbid Disease Progressions Using Directed Hypergraphs](https://www.medrxiv.org/content/10.1101/2023.08.31.23294903v1) |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
29 changes: 29 additions & 0 deletions docs/our_work/p41_nhssynth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
layout: base
title: NHSSynth
permalink: p41_nhssynth.html
summary: Continuation from the two previous synthevae projects with the aim to create a full experiment pipeline for production
tags: ['SYNTHETIC', 'STRUCTURED', 'MACHINE LEARNING', 'GENERATION' ,'RESEARCH']
---

<figure markdown>
![](../images/nhssynthmodules.png)
</figure>
<figcaption>Structure of the workflow incorporating user configuration, data preprocessing, model selection, evaluation, and visualisation</figcaption>

JRPearson500 marked this conversation as resolved.
Show resolved Hide resolved
This project seeked to take the learning from our previous work on variational autoencoders see [SynthVAE](https://nhsengland.github.io/datascience/our_work/p12_synthvae/) with differential privacy for single table tabular data generation, and turn the code into a pipeline where experiments could be rigorously undertaken including comparison with other architectures (e.g. GANs), application to other datasets with comparable metrics, and experiments around constraining the direct acylic graph to deal with biases in the data.

## Results

The pipeline is contained within the open code and allows for both config files to be run or a simpler command line interface. Models can be switched in and out with a moderate amount of effort allowing for consistent comparisons and taking our synthetic generation work from a single investigation of a model to exploring how the latest models compare to our current workflows.

Further work is needed to fix a bug when applying constraints and to enforce the mixed Guassian model to include higher modes.

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [Github](https://github.com/nhsengland/NHSSynth) |
| Case Study | Awaiting Sign-Off |
| Technical report | [io page documentation](https://nhsengland.github.io/NHSSynth/) |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
29 changes: 29 additions & 0 deletions docs/our_work/p42_mortalityhypergraphs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
layout: base
title: Including Mortality in Hypergraphs for Multi-morbidity
permalink: p42_mortalityhypergraphs.html
summary: Building on previous hypergraphs work (P34) that can extract the impact of predecessor and successor diseases on disease progression pathways, this work looked to include an implicit relationship to demographics and consider the impact of mortality.
tags: ['RESEARCH', ''TIME SERIES', 'MODELLING', 'PRIMARY CARE']
---

<figure markdown>
![](../images/MortalityHypergraohs.png)
</figure>
<figcaption>View of different charleston conditions highlight if there are more commonly predesecors or successors. When moralitiy is included this skews the results unless correction is implemented.</figcaption>

JRPearson500 marked this conversation as resolved.
Show resolved Hide resolved
This project investigated an approach to analysing disease set patterns using hypergraphs and multimorbidity data. Hypergraphs provide a powerful framework for modelling complex relationships among diseases, and their integration with multimorbidity data offers a comprehensive understanding of the co-occurrence of multiple diseases within patient populations. Additionally, this work extends on the previous work by incorporating mortality information into the hypergraphs and exploring the concept of temporality as hyperarc weights. The inclusion of mortality data enhances the analysis by considering the impact of diseases on patient outcomes, whilst temporality enables the inclusion of irregular time intervals which captures the dynamic nature of multimorbidity patterns over time.

## Results
To facilitate the understanding of hypergraphs and their applications in the multimorbidity domain, an interactive applet has been developed. This serves as an educational tool and visualisation device, teaching users about undirected and directed hypergraphs and demonstrating their usefulness in analysing complex disease relationships. We hope that our applet and the code bases we have created will promote the dissemination of knowledge about hypergraphs and their applications, empowering individuals to explore and comprehend complex healthcare data in the multimorbidity domain.

See the full report for detailed results around the addition of mortality into this work.

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [GitHub](https://github.com/nhsx/hypergraph-mm) |
| Case Study | Awaiting Sign-Off |
| Technical report | [Report](https://github.com/nhsx/hypergraph-mm/blob/zh-hypergraph-mm-mort/reports/Hypergraph_mm_mort_report_ZH.pdf) |
| Demonstration | [Streamlit](https://nhsx-hypergraphical-streamlit-hypergraphs-hklixt.streamlit.app/) |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
34 changes: 34 additions & 0 deletions docs/our_work/p43_medcat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
layout: base
title: Enriching Clinical Coding for Neurology Pathways using MedCAT
permalink: p43_medcat.html
summary: In collaboration with Lancaster teaching hospital and the University of Lancaster we aim to apply MedCat (an automated named entity recognition with linkage algorithm) to neurology letters to identify related SNOMED CT coding.
tags: ['NATURAL LANGUAGE PROCESSING', 'UNSTRUCTURED', 'RESEARCH']
---

Neurology and other clinical specialities are awash with clinical data. However, these are generally not structured and lack the characteristics to allow straightforward automatic extraction of clinically relevant concepts. Software tools do exist that can recognise clinical terms in unstructured clinical data (e.g. clinic letters) and link them to other concepts. These are called ‘named entity recognition and linking’ (NER+L) tools. But many such tools require prior ‘labelling’ by a domain expert (i.e. person with specialty knowledge) of the relevant clinical concepts. MedCAT is a NER+L tool that can work without this prior labelling as it contains an algorithm that is aligned with a customisable knowledge database (ontology). This works in two stages: 1) linking unambiguous portions of texts (entities) to unique terms in the ontology then 2) linking ambiguous entities to terms in the ontology with the most similar contexts.

However, evaluation of the MedCAT models which inform the NER+L process has only been performed on labelled data, and the learned numerical representations of concepts (embeddings) has not been assessed before. The contributions of this project were:

1. evaluation of three separate MedCAT models,
2. comparison of three different clustering techniques as evaluation methods in the absence of labelled data,
3. evaluation of MedCAT’s learned concept embeddings,
4. comparison of intrinsic and extrinsic evaluation metrics and
5. comparison of qualitative and quantitative evaluation approaches.

<figure markdown>
![](../images/medcat2.png)
</figure>
<figcaption>Schematic representation of the MedCAT workflow</figcaption>

JRPearson500 marked this conversation as resolved.
Show resolved Hide resolved
## Results
We found that all three models produced NER+L results which are not consistent with clinical understanding. Clustering can enable deeper examination of learned embeddings, but further work needs to be done on finding the best input data and clustering approach. Intrinsic evaluation metrics are only meaningful in the presence of extrinsic measures and further research needs to be done to identify the most informative set of metrics. Quantitative assessment must be supplemented by qualitative inspection. The work performed here forms the first phase in evaluation of MedCAT models’ performance. Once optimal evaluation strategies have been identified, the next phase can be focused on improving MedCAT models. This will ultimately enable extraction of clinical terms that can be used for multiple downstream tasks such as automated clinical coding, research, monitoring of interventions, audits as well as service improvements.

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [Github](https://github.com/nhsengland) |
| Case Study | Awaiting Sign-Off |
| Technical report | [Report](https://github.com/nhsengland/P43_LTHMedCat/blob/main/report/MedCAT_Neurology_Report.pdf) |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
18 changes: 18 additions & 0 deletions docs/our_work/p51_privconcerns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
layout: base
title: Investigating Privacy Risks and Mitigations in Healthcare Language Models
permalink: p51_privconcerns.html
summary: An initial exploration of privacy risks in healthcare language models, including privacy-preserving techniques applied before or after model training, and evaluating their effectiveness with privacy attacks.
tags: ['LARGE LANGUAGE MODELS (LLM)', 'PATIENT IDENTIFIABLE', 'RESEARCH']
---

See our blog on this work [here](https://nhsengland.github.io/datascience/articles/2024/04/11/privLM/)

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [Code](https://github.com/nhsengland/priv-lm-health) |
| Case Study | Awaiting Sign-Off |
| Technical report | [Report](https://github.com/nhsengland/priv-lm-health/blob/main/reports/Healthcare_LLM_Privacy_VS_v1.0.pdf)] |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
38 changes: 38 additions & 0 deletions docs/our_work/p52_processmining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
layout: base
title: Process Mining with East Midlands Ambulance Service
permalink: p52_processmining.html
summary: In collaboration with East Midlands Ambulance Service, this work explored using Process Mining techniques to better understand the processes within the service.
tags: ['EMERGENCY CARE', 'SIMULATION', 'MODELLING']
---

Process Mining is the term given to a family of techniques that have been developed to analyse process flows through the recording of an event log. Process Mining has three main application areas: discovery, conformance checking, and enhancement. In **process discovery**, the output is a fact-based process model based on the recorded events in the database. Discovery is the most common process mining investigation, where the discovered process model is often not expected by the supervisor of the process. In **conformance checking** we measure how well the recorded processes fits a pre-defined process model. Conformance-checking techniques can therefore be used to check whether certain rules and policies are being abided by within an organisation. Finally, **enhancement** is focused on improving the existing process models by analysing how additional attributes affect the throughput times and frequencies of activities in the process.

For this work we converted ambulance data into an event log and applied process mining techniques using the [PM4Py library](https://pypi.org/project/pm4py/). We used the PM2 methodologies to conduct the process mining project.

<figure markdown>
![](../images/pm2.png)
</figure>
<figcaption>PM2 Framework</figcaption>

## Results

Most of the project time was spent designing appropriate business rules to be applied to the data in order to create the event log. Directly Follows Graphs (DFGs) were then created to represent the highest occurring processes. Conformance techniques were used to measure how well the mined process models fitted the event log data. However, the variation in the treatment activities also caused the mined models to allow events including the call for an ambulance to happen in a sequence counterintuitive to real ambulance job cycles.

<figure markdown>
![](../images/spaghetti.png)
</figure>
<figcaption>A simple spaghetti diagram showing the process flow through activities captured in the event log</figcaption>

A succession of further process mining techniques were then applied to the data. A majority of these involved enriching the data and the iterative process around adding data to the event log to find out more about it. This is especially important with event logs where two processes account for such a high percentage of cases, (99% in our case). Enrichment here meant attaching IMD information to postcodes or areas in the data as well as retrieving additional data about each patient, for example, age, sex, ethnicity, and adding that back into the event log. Enrichment also meant feature engineering - transforming variables into either binary, or time of day into categorical, e.g. to morning, afternoon and night.

Lastly, machine learning was applied to see if outcomes could be predicted. Please see the report or case study for full details.

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [github](https://github.com/nhsengland/ProcessMining) |
| Case Study | Awaiting Sign-Off |
| Technical report | [Report](https://github.com/nhsengland/ProcessMining/blob/main/Process_Mining_to_Generate_Healthcare_Pathways.pdf) |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
49 changes: 49 additions & 0 deletions docs/our_work/p61_mmfair.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
layout: base
title: Understanding Fariness and Explainability in Multimodal Approaches within Healthcare
permalink: p61_mmfair.html
summary: Pipeline to compare the impact on fairness of using a fusion model versus a single modality model
tags: ['MULTI-MODAL', 'RESEARCH', 'OPEN DATA']
---

Explainability, fairness, and bias identification and mitigation are all essential components for the
integration of artificial intelligence (AI) solutions in high-stake decision making processes such as in
healthcare. Whilst there have been developments in strategies to generate explanations and various
fairness criteria across models, there is a need to better understand how multimodal methods impact
these behaviours. Multimodal AI (MMAI) provides opportunities to improve performance and gain
insights by modelling correlations and representations of data of different types.

These approaches are incredibly powerful for the analysis of healthcare data, where the integration of data sources is key
for gaining a holistic view of individual patients (personalised medicine) or evaluating models across
different patient profiles to ensure safe and ethical use (population health). However, MMAI presents
an unique challenges when deciding how best to incorporate and fuse information, maintaining an
understanding of how data is processed (explainability), and ensuring bias is not amplified as a result.

Here, we explore a case study, Multimodal Fusion of Electronic Health Data for Length-of-Stay
Prediction, with a focus on treating time-series and static electronic health data as distinct modalities.
We evaluate and compare different methods for fusing data in terms of predictive performance and
various fairness metrics. Additionally, we apply SHAP to highlight the influence of specific features
and explore how such explanations can be used to reveal or confirm bias in the underlying data. Our
results showcase the importance of modelling time-series data, and an overall robustness to bias
compared to unimodal approaches across various fairness metrics. We also describe exploratory
analysis which can be conducted and developed further to mitigate bias post-hoc, or gain further
insights into the relative importance of specific modalities from multimodal models.

<figure markdown>
![](../images/mmfair.png)
</figure>
<figcaption>Overview of the models: unimodal, multimodal with two fusion methods: concatenation and a multiadaptation gate (MAG)</figcaption>
## Results

We found a reassuring consistency in feature importance across different fusion methods. Both fusion methods produced fairer models with respect to insurance type (the most unfair variable in the unimodal case) with concatenation providing the lowest equalised odds.

Further work is planned to expand this pipeline to other modalities and datasets to solidify our understanding of the interplay of model choices on fiarness.

| Output | Link |
| ---- | ---- |
| Open Source Code & Documentation | [Github](https://github.com/nhsengland/mm-healthfair) |
| Case Study | Awaiting Sign-Off |
| Technical report | [Report](https://github.com/nhsengland/mm-healthfair/blob/main/report/NHSE_Internship_Project_Multimodal_XAI.pdf) |

[comment]: <> (The below header stops the title from being rendered (as mkdocs adds it to the page from the "title" attribute) - this way we can add it in the main.html, along with the summary.)
#
Loading