02-synthpop.Rmd

# Generating data using synthpop methods {#synthpop}

In this chapter we will look at how to generate synthetic data on the server side using DataSHIELD functions, based on **synthpop** functionality. In this scenario the synthetic data is generated on the server side and returned to the client.

The `syn` function in the **synthpop** package has many detailed options for optimising the generation of synthetic data. A limited number of these options are available in the **dsSynthetic** package. Details of how to use these options can be found in the vignettes for **synthpop**.

## Getting set up

First we need to build a login object for the server that holds the data. Note that the **dsSynthetic** functions have been written to work with a connection to a single server:

```{r log in object}
builder <- DSI::newDSLoginBuilder()
# hide credentials
builder$append(server="server1", url="https://opal-sandbox.mrc-epid.cam.ac.uk",
               user="dsuser", password="P@ssw0rd", 
               table = "DASIM.DASIM1")
logindata <- builder$build()
```

And then we establish a connection to the server:

```{r log in, message=FALSE}
library(DSOpal)
if(exists("connections")){
  datashield.logout(conns = connections)
}
connections <- datashield.login(logins=logindata, assign = TRUE)
```
## Generate synthetic data with synthpop

The recommended way to generate a synthetic dataset is by using an implementation of the **synthpop** package on the server side. **synthpop** requires some thought on the part of the user: if you have a data set with a large number of columns it may take a large amount of time to generate the synthetic data. Assuming we have a data set with a small number of columns (i.e. around 10) we can simply execute the following command:

```{r synthpop data gen}
library(dsSyntheticClient)
library(dsBaseClient)
# N.B. you may need to replace `server1` if you have named your connection differently
synth_data = ds.syn(data = "D", method = "cart", m = 1, seed = 123)$server1$Data$syn
```
We then have the synthetic data on the client side and can view and manipulate it as required:
```{r synthpop data check}
head(synth_data)
```
If you have a dataset with a larger number of columns, you could generate a synthetic dataset for a subset of the variables that you need to generate a particular part of your code development. For example if we needed to generate a diabetes variable based on blood triglycerides, HDL and glucose we could just generate a dataset for those variables:
```{r warning=FALSE, "synthpop data subset"}
ds.subset(x = "D", subset = "D2", cols = c("LAB_HDL", "LAB_TRIG", "LAB_GLUC_FASTING"))
# N.B. you may need to replace `server1` if you have named your connection differently
synth_data_sub = ds.syn(data = "D2", method = "cart", m = 1, seed = 123)$server1$Data$syn
head(synth_data_sub)
```
Lastly we save our data for later chapters:
```{r synthpop write}
write.csv(x = synth_data, file = "data/synth_data.csv")
```
## Brief comments on the validity of the data

In the example above we chose to synthetically generate the variable `PM_BMI_CATEGORICAL`. This variable is actually derived from the continuous variable `PM_BMI_CONTINUOUS`, with (say) BMI <25 being category 1, 25<= BMI < 30 being category 2 etc. Because of the probabilistic way in which the data are generated, this categorisation is not enforced in the data synthesis. It might be better to generate the continuous variable only and add the categorical variable afterwards.