This repository contains the code and documentation for a research project on generating synthetic smartwatch health data while preserving the privacy of the original data owners. The project uses a combination of differential privacy and generative adversarial networks (DP-GANs) to create synthetic data that closely resembles the original data in terms of statistical properties and data distributions.
Smartwatch health data is increasingly being used in health applications and patient monitoring, including stress detection. This data contains a high level of sensitive personal information about individuals and patients. At the same time, it is burdensome and expensive to collect this data from many individuals. In this paper, we present an approach to generate privacy-preserving smartwatch data. We test the performance of synthetic sequence data generated by different state-of-the-art GANs: TimeGAN, cGAN, and DoppelGANger.
We train and hyperparameter tune these GANs for our use case. Using these GANs, we generate synthetic datasets that learn the statistical distribution of WESAD, a wearable stress detection dataset. To evaluate these synthetic datasets, we present an evaluation method consisting of three evaluation facets: diversity, fidelity, and usefulness.
We use a PCA and t-SNE analysis to evaluate the diversity of the dataset, and use different machine learning classifiers to test the indistinguishability of the synthetic and the original data.
To evaluate usefulness, we train convolutional neural network (CNN) stress detection models on the synthetic data in two different ways. First, we augment the original dataset by the synthetic dataset and evaluate the impact of the new synthetic samples on the training of the stress detection models. We also perform the Train on Synthetic Test on Real (TSTR) evaluation procedure, where we train the dataset exclusively on the synthetic data and test on the real data.
By augmenting the original dataset with synthetic data generated by the cGAN architecture, we achieve a stress detection accuracy of 0.9115 and can thus improve the baseline approach without synthetic data by 0.0275. When training only on synthetic data generated by the cGAN, the stress detection model achieves an accuracy of 0.910 on the original WESAD dataset, improving the baseline by 0.02.
To make the synthetic datasets generated by the GANs privacy-preserving, we apply Differential Privacy (DP) to the cGAN that achieves the best results. We evaluate the results of this DP-cGAN analogously to the GAN evaluation method for
This work demonstrates that GANs, and more specifically DP-GANs, can be used to generate synthetic health data. The synthetic data mimics the statistical distribution and physiological stress response characteristics of the WESAD dataset. It also ensures privacy guarantees with a performance trade-off to improve stress detection results.
Smartwatch health data has become an increasingly popular source of information for healthcare research and personalized medicine. However, the use of such data raises concerns about privacy, as the data often contains sensitive information about individuals' health and fitness. In this project, we aim to address these privacy concerns by generating synthetic health data that can be used in research and analysis while protecting the privacy of the original data owners.
Our approach uses a combination of differential privacy and generative adversarial networks (GANs).
The following dataset is required to run the code in this repository:
- WESAD Dataset (2,1GB)
- Python 3.8
Download the WESAD dataset here and save the WESAD directory inside the data directory
.
To install the required dependencies, run the following command:
pip install -r requirements.txt
The repository consists of multiple notebooks representing the workflow of this work. Every notebook is one step of this workflow starting with the data preprocessing going over to the model training, synthesizing of the new generated dataset, to evaluating it with a newly trained respective stress detection model.
The data is loaded from the original WESAD dataset preprocessed and saved within a new file under a new named file wesad_preprocessed_1hz.csv. You can skip downloading the 2,1GB WESAD dataset and preprocessing and work with the already preprocessed WESAD dataset. This consists of two numpy arrays wesad_windows.npy and wesad_labels.npy.
This notebook focuses on training the cGAN model. It loads the preprocessed data from the previous 01-Data notebook and runs the training for the cGAN model.
This notebook focuses on training the TimeGAN model. It loads the preprocessed data from the previous 01-Data notebook and runs the training for the TimeGAN model.
This notebook focuses on training the DGAN model. It loads the preprocessed data from the previous 01-Data notebook and runs the training for the DGAN model.
The generator notebook is responsible for synthesizing a new dataset based on the trained GAN model. The generated data is saved separately in the syn data folder.
In the evaluation notebook, we assess the quality of the synthetically generated dataset using visual and statistical metrics. The usefulness evaluation takes place in the 05-Stress_Detection notebook.
This notebook focuses on training a CNN model to perform stress detection on the synthetic dataset, simulating a real-world use case.
We have also developed a frontend for the generator using Streamlit, which provides a user-friendly interface to interact with the trained GAN model. You can specify different parameters, generate synthetic data, and visualize the results.
To run the Streamlit app, navigate to the streamlit_app directory in your terminal, and run the following command:
streamlit run streamlit_app/About.py
This will start the Streamlit server and open the app in your default web browser.
The research artifacts resulting from this work are available in a condensed format in this repository.
Some of the synthetic datasets, generated from the different GAN architecutres, are located in the data/syn path.
The trained models can be found in the model directory. These can be used in the Generator frontend to generate new synthetic data.
I would like to extend my sincere thanks to Maximilian Ehrhart and Bernd Resch for sharing their code related to their paper titled "A Conditional GAN for Generating Time Series Data for Stress Detection in Wearable Physiological Sensor Data". Their work on implementing the cGAN architecture and their insights on training it have been important to the success of our project.