-
Notifications
You must be signed in to change notification settings - Fork 0
/
05_synthetic-data-case-studies.qmd
91 lines (74 loc) · 4.29 KB
/
05_synthetic-data-case-studies.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "Synthetic Data Case Studies"
date: today
format:
html:
fig-cap-location: top
number_sections: false
embed-resources: true
code-fold: true
toc: true
css: ../www/web_report.css
editor_options:
chunk_output_type: console
execute:
warning: false
message: false
bibliography: references.bib
---
```{=html}
<style>
@import url('https://fonts.googleapis.com/css?family=Lato&display=swap');
</style>
```
```{r setup}
#| label: setup
#| echo: false
options(scipen = 999)
library(tidyverse)
library(gt)
library(palmerpenguins)
library(urbnthemes)
library(here)
#set_urbn_defaults()
```
## Case Studies
### Fully Synthetic PUF for IRS Non-Filers [@bowen2020synthetic]
* **Data:** A 2012 file of "non-filers" created by the IRS Statistics of Income Division.
* **Motivation:** Non-filer information is important for modeling certain tax reforms and this was a proof-of-concept for a more complex file.
* **Methods:** Sequential CART models with additional noise added based on the sparsity of nearby observations in the confidential distribution.
* **Important metrics:**
* General utility: Proportions of non-zero values, first four moments, correlation fit
* Specific utility: Tax microsimulation, regression confidence interval overlap
* Disclosure: Node heterogeneity in the CART model, rates of recreating observations
* **Lessons learned:**
* Synthetic data can work well for tax microsimulation.
* It is difficult to match certain utility metrics for sparse variables.
### Fully Synthetic SIPP data [@benedetto2018creation]
* **Data:** Survey of Income and Program Participation linked to administrative longitudinal earnings and benefits data from IRS and SSA.
* **Motivation:** To expand access to detailed economic data that is highly restricted without heavy disclosure control.
* **Methods:** Sequential regression multiple imputation (SRMI) with OLS regression, logistic regression, and Bayesian bootstrap. They released four implicates of the synthetic data.
* **Important metrics:**
* General utility: pMSE
* Specific utility: None
* Disclosure: Distance based re-identification, RMSE of the closest record to measure attribute disclosure
* **Lessons learned:**
* One of the first major synthetic files in the US.
* The file includes complex relationships between family members that are synthesized.
### Partially Synthetic Geocodes [@drechsler2021synthesizing]
* **Data:** Integrated Employment Biographies (German administrative data) with linked geocodes (latitude and longitude)
* **Motivation:** Rich geographic information can be used to answer many important labor market research questions. This data would otherwise would be too sensitive to release, due to the possibility of identifying an individual based on the combination of their location and other attributes.
* **Methods:** CART with categorical geocodes. Also evaluated CART with continuous geocodes and a Bayesian latent class model.
* **Important metrics:**
* General utility: Relative frequencies of cross tabulations
* Specific utility: Zip Code comparisons of tabulated variables, Ripley's K- and L-functions
* Disclosure: Probabilities of re-identification (Reiter and Mitra, 2009) -> comparison of expected match risk and the true match rate
* **Lessons learned:**
* The synthetic data with geocodes had more measured disclosure risk than the original data.
* Synthesizing more variables made a huge difference in the measured disclosure risks.
* Adjusting CART hyperparameters was not an effective way to manage the risk-utility tradeoff.
* They stratified the data before synthesis for computational reasons.
<br><br><br>
## Suggested Reading
Snoke, Joshua, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018b. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3): 663–88.
Bowen, Claire McKay, Victoria Bryant, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Philip Stallworth, Kyle Ueyama, and Aaron R Williams. 2020. “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications.” In International Conference on Privacy in Statistical Databases, 257–70. Springer.