-
Notifications
You must be signed in to change notification settings - Fork 0
/
07_differential_privacy_details.qmd
813 lines (534 loc) · 33.9 KB
/
07_differential_privacy_details.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
---
title: "Formally Private Definitions, Fundamental Mechanisms, and Algorithms"
date: today
format:
html:
fig-cap-location: top
number_sections: false
embed-resources: true
code-fold: false
toc: true
css: ../www/web_report.css
editor_options:
chunk_output_type: console
execute:
warning: false
message: false
bibliography: references.bib
---
```{=html}
<style>
@import url('https://fonts.googleapis.com/css?family=Lato&display=swap');
</style>
```
```{r setup}
options(scipen = 999)
library(tidyverse)
library(urbnthemes)
library(smoothmest)
set_urbn_defaults()
```
## Recap
### Formal privacy definitions
In general, formally private methods have the following features [@bowen2021philosophy]:
- Ability to quantify and adjust the privacy-utility trade-off, typically through parameters.
- Ability to rigorously and mathematically prove the maximum privacy loss that can result from the release of
information.
- Formal privacy definitions also allow one to *compose* multiple statistics. In other words, a data curator can compute
the total privacy loss from multiple individual information releases.
### Privacy loss budget
Differential privacy and other formal privacy uses the concept of a **privacy loss budget**, typically represented mathematically as $\epsilon$. The privacy loss budget bounds the privacy risk associated with releasing data or query results.
(*Note:* $\epsilon$ *is not the only privacy loss parameter, but we will use it here as a general representation of the privacy loss budget.*)
- A larger value of $\epsilon$ increases the maximum disclosure risk (the upper bound of the disclosure risk) associated with a given release of information.
- larger $\epsilon$ = less noise added to data = more accuracy, but less privacy
- smaller $\epsilon$ = more noise added to data = less accuracy, but more privacy
- Extreme cases (note that these cases are not realistic in the sense of real-world applications, but are presented to demonstrate the intuition):
- $\epsilon \to \infty$
- all privacy will be lost; data retains all utility, but no privacy
- $\epsilon \to 0$
- no privacy is lost; data is completely distorted and no utility remains
## Formal Privacy Features
Formal privacy is a relatively new set of definitions for quantifying the worst-case amount of information disclosed from statistics calculated on a private database. We provide conceptual and mathematical definitions below.
### $\epsilon$-Differential privacy
::: {.panel-tabset}
### Assumptions Underlying Privacy Guarantee
Formal privacy does not make assumptions about:
- how a data intruder will attack the data;
- the amount of external information or computing power an intruder has access to, now or in the future;
- which information in the data poses a higher disclosure risk [@near2020differential].
Instead, formal privacy assumes the worst-case scenario:
- the intruder has information on every observation except one;
- the intruder has unlimited computational power;
- missing observation is the most extreme possible observation (or an extreme outlier) that could alter the statistic.
### Mathematical Definition
We mathematically define several formally private definitions and key theorems. We use the following notation: $X\in\mathbb{R}^{n\times r}$ is the confidential data set representing $n$ data points and $r$ variables and $M:\mathbb{R}^{n\times r}\rightarrow\mathbb{R}^k$ denotes the statistical query, i.e., $M$ is a function mapping $X$ to $k$ real numbers. We denote a randomized or noisy version of $M$ using $\mathcal{M}:\mathbb{R}^{n\times r}\rightarrow\mathbb{R}^k$, which is a function that satisfies a formally private definition.
DP is the most widely known formal privacy definition. Privacy experts often refer to the original definition of DP as pure-DP or $\epsilon$-DP.
**Differential Privacy** [@dwork2006calibrating]: A sanitization algorithm, $\mathcal{M}$, satisfies $\epsilon$-DP if for all subsets $S\subseteq Range(\mathcal{M})$ and for all $X,X'$ such that $d(X,X')=1$, \begin{equation}\label{eqn:dp}
\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)
\end{equation} where $\epsilon>0$ is the privacy loss budget and $d(X,X')=1$ represents the possible ways that $X'$ differs from $X$ by one record.
:::
::: {.callout-note}
## Features to note
- $\epsilon$ is logarithmic.
- This is an inequality, not an equation; $\epsilon$ is up to us to define and represents an upper bound for disclosure risk that we are comfortable with for our particular data.
:::
### Global Sensitivity
In addition to the privacy loss budget, most formally private methods rely on the concept called global sensitivity, which describes how resistant the formally private sanitizer is to the presence of outliers [@bowen2021philosophy]. We can think of the global sensitivity as another value that helps determine how much noise is needed to protect the released data or statistic, because some information is more sensitive than other information to outliers.
Imagine the data we want to protect contains socioeconomic information and the question we want answered is, "What is the median wealth?" Under formal privacy, we must consider the change of the most extreme possible record that could exist in any given data that has demographic and financial information. In our example, that person is Elon Musk, who was the wealthiest person in the world in 2023.[^2] If Musk is present or absent in the data, the median wealth should not change too much. This means we can provide a more accurate answer by applying fewer alterations to the median income statistic, because it is less sensitive to the extreme outlier, Musk Consider, however, the question, "What is the average wealth?" Unlike the previous statistic, the answer would significantly change if Musk were present or absent from the data. To protect the extreme case at a given level of privacy loss, a formally private algorithm would need to provide a significantly less accurate answer by altering the statistic more.
[^2]: At the time of session, Elon Musk is the wealthiest person in the world.
### $l_1$-Global Sensitivity
::: {.panel-tabset}
#### Conceptual
**$l_1$-Global Sensitivity** [@dwork2006calibrating]: The $l_1$-global sensitivity calculates the maximum amount a statistic can change in absolute value terms with the addition or removal of the most extreme possible observation.
#### Technical
**$l_1$-Global Sensitivity** [@dwork2006calibrating]: For all $X,X'$ such that $d(X,X')=1$, the global sensitivity of a function $M$ is \begin{equation}\label{eqn:gs}
\Delta_1 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_1
\end{equation}
For scalars, the $l_1$-Global Sensitivity is $|M(X) - M(X')|$.
:::
### $l_2$-Global Sensitivity
::: {.panel-tabset}
#### Conceptual
**$l_2$-Global Sensitivity** [@dwork2006calibrating]: $l_2$-global sensitivity calculates the maximum amount a statistic can change when the statistic is squared, summed, and rooted with the addition or removal of the most extreme possible observation.
#### Technical
**$l_2$-Global Sensitivity** [@dwork2006calibrating]:
For all $X,X'$ such that $d(X,X')=1$, the global sensitivity of a function $M$ is
$$\Delta_2 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_2$$
For scalars, the $l_2$-Global Sensitivity is $\sqrt{(M(X) - M(X'))^2}$.
:::
::: {.callout-note}
## Important note
Global sensitivity is straightforward but calculating the global sensitivity for some statistics can be very difficult. For instance, we cannot calculate a finite global sensitivity of sample mean if we do not bound the variable.
:::
## Exercise 1
::: {.panel-tabset}
### <font color="#55b748">**Question**</font>
Suppose we are interested in counting the number of sole proprietorships in Washington, DC. What are the $l_1$ and $l_2$ global sensitivities of this statistic?
### <font color="#55b748">**Hint**</font>
In other words, what is the maximum difference between $M(X)$ and $M(X')$ when $d(X,X')=1$?
### <font color="#55b748">**Solution**</font>
The answer is one. The most a count can change by adding or subtracting one observation is one!
$\Delta_1 (M) = \Delta_2 (M) = 1$
:::
## Exercise 2
::: {.panel-tabset}
### <font color="#55b748">**Question**</font>
Suppose we are interested in calculating the total income of sole proprietorships in Washington, DC. What are the $l_1$ and $l_2$ global sensitivities of this statistic?
### <font color="#55b748">**Hint**</font>
In other words, what is the maximum difference between $M(X)$ and $M(X')$ when $d(X,X')=1$?
### <font color="#55b748">**Solution**</font>
The answer is $\infty$. A total can theoretically change by any amount with the addition or deletion of one observation.
:::
## Statistics
### Counts
Counts are the best explored statistics in differential privacy. With unbounded differential privacy, the global sensitivity of a count is always 1.
For example, assume we are counting the number of billionaires in the United States. The most the count can change with the addition or removal of Elon Musk is one.
### Sums
Calculating the global sensitivity is more difficult for other statistics than counts. The global sensitivity of a sum is unbounded because the addition or removal of one row can theoretically change the sum by any amount.
One approach is to **clip** or **truncate** values. If we assume that all observations are between 6 and 10, inclusive, then we can treat the global sensitivity as $10 - 6 = 4$.
* Differential privacy does not hold if we look at the data to determine the bounds.
* Bounds that truncate actual values bias statistics.
* This assumption is often problematic with economic data where distributions can be highly skewed.
### Means
Means can be rewritten as two queries: a total divided by a count.
1. GS(sum) / GS(count)
Sometimes the number of observations is assumed to be known. In this case, the global sensitivity is smaller.
2. GS(sum) / n if we assume n is known
## DP Sanitizers
A sanitizer protects against disclosure risk. A differentially private sanitizer protects against disclosure risk and meets the definition of differential privacy. If we know the global sensitivity of statistics, then we can often add noise in a way that sanitizers satisfy differential privacy. We review three fundamental formally private sanitizers. We call these sanitizers fundamental because they are the original formally private sanitizers that privacy researchers still often use as building blocks for more sophisticated formally private methods.
### Laplace sanitizer
@dwork2006calibrating first proposed protecting statistics by adding noise from a Laplace distribution, called the Laplace mechanism, but we can think of it as a Laplace sanitizer. The Laplace distribution is centered at zero and the distribution variability is the ratio of the privacy loss budget, $\epsilon$, over the $l_1$-global sensitivity of the statistic. Since the distribution is centered at zero, there is a higher probability of adding very little or no noise to the statistic. For the noise variability, if $\epsilon$ is large or the sensitivity of the statistic is low, then there is a higher probability of adding very little noise to the confidential data statistic. If $\epsilon$ is small or the sensitivity of the statistic is high, then there is a higher probability of adding a lot of noise to the statistic.
::: {.panel-tabset}
#### Conceptual
The Laplace sanitizer satisfies $\epsilon$-DP by adding noise from a Laplace distribution to statistics. More sensitivity means more expected noise. More $\epsilon$ means less expected noise.
```{r echo = FALSE, fig.height = 3.5}
ggplot() +
geom_function(fun = function(x) smoothmest::ddoublex(x, mu = 0, lambda = 1),
xlim = c(-8, 8),
aes(color = "l1-sensitivity = 1")) +
geom_function(fun = function(x) smoothmest::ddoublex(x, mu = 0, lambda = 2),
xlim = c(-8, 8),
aes(color = "l1-sensitivity = 2")) +
geom_function(fun = function(x) smoothmest::ddoublex(x, mu = 0, lambda = 3),
xlim = c(-8, 8),
aes(color = "l1-sensitivity = 3")) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank()) +
labs(title = "Laplace sanitizer with different sensitivities",
subtitle = "Epsilon = 1",
color = "Distribution")
```
```{r echo = FALSE, fig.height = 3.5}
ggplot() +
geom_function(fun = function(x) smoothmest::ddoublex(x, mu = 0, lambda = 10),
xlim = c(-8, 8),
aes(color = "Epsilon = 0.1")) +
geom_function(fun = function(x) smoothmest::ddoublex(x, mu = 0, lambda = 2),
xlim = c(-8, 8),
aes(color = "Epsilon = 0.5")) +
geom_function(fun = function(x) smoothmest::ddoublex(x, mu = 0, lambda = 1),
xlim = c(-8, 8),
aes(color = "Epsilon = 1")) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank()) +
labs(title = "Laplace sanitizer with different epsilons",
subtitle = "l1-sensitivity = 1",
color = "Distribution")
```
#### Technical
**Laplace Mechanism** [@dwork2006calibrating]: Given any function $M:\mathbb{R}^{n\times r}\rightarrow\mathbb{R}^k$, the Laplace mechanism is defined as: \begin{equation}\label{eqn:lap}
\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k).
\end{equation} where $(\eta_1,\ldots,\eta_k)$ are i.i.d. $Laplace(0, \frac{\Delta_1(M)}{\epsilon})$.
:::
### Laplace sanitizer Example
Let's consider a simple example with the Palmer Penguins data set. The data set contains 333 observations about Adelie, Chinstrap, and Gentoo penguins in Antarctica. Suppose we want to count how many penguins are Adelie penguins.
```{r}
penguins <- palmerpenguins::penguins |>
drop_na()
penguins
```
The global sensitivity is $\frac{\Delta_1(M)}{\epsilon} = \frac{1}{\epsilon}$. This means our formally private statistic is one draw from a Laplace distribution with center at the confidential statistics and scale parameter equal to $\frac{1}{\epsilon}$.
Below is code for drawing values from a Laplace distribution, which we will call `laplace_sanitizer()`.
```{r}
# function to draw Laplace noise for one statistic
laplace_sanitizer <- function(sensitivity, epsilon, n = 1) {
# lambda (distribution width) is sensitivity/privacy loss budget
l <- sensitivity / epsilon
# draw from Laplace distribution
noise <- smoothmest::rdoublex(n = n, # draw one observation (adding noise to one statistic)
mu = 0, # centered at 0
lambda = l) # scale based on l calculated above
return(noise)
}
```
Let's calculate the statistic without any noise.
```{r}
answer_conf <- sum(penguins$species == "Adelie")
answer_conf
```
Now, let's calculate the statistic with noise that satisfies the definition of $\epsilon$-differential privacy.
```{r}
set.seed(1)
answer_dp <- answer_conf + laplace_sanitizer(sensitivity = 1, epsilon = 0.1)
answer_dp
```
*Maybe we got a lucky or unlucky draw from the Laplace distribution.* Let's calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.
```{r echo = FALSE}
set.seed(20220427)
tibble(
adelie_penguins = answer_conf + map_dbl(.x = 1:100, ~laplace_sanitizer(sensitivity = 1, epsilon = 0.1))
) |>
ggplot(aes(adelie_penguins)) +
geom_histogram() +
geom_vline(xintercept = answer_conf, color = "#55b748", size = 2) +
geom_vline(xintercept = answer_dp, color = "#ec008b", size = 2) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
labs(
title = "100 Iterations of Counting Adelie Penguins",
subtitle = "Epsilon = 0.1; Truth in Green; Single Draw in Magenta"
)
```
### Gaussian Sanitizer
Similar to the Laplace mechanism, the Gaussian mechanism adds random noise from a Gaussian distribution. The Gaussian distribution is also centered at zero and the distribution variability is the ratio of the privacy loss budget, $\epsilon$, over the $l_2$-global sensitivity of the statistic (mathematically, the Gaussian mechanism does not satisfy $l_1$-global sensitivity).
::: {.panel-tabset}
#### Conceptual
The Gaussian sanitizer satisfies $(\epsilon,\delta)$-DP by adding noise from a Gaussian distribution (also known as Normal distribution or bell curve) to statistics. More sensitivity means more expected noise. More $\epsilon$ means less expected noise. More $\delta$ means less expected noise.
```{r echo = FALSE, fig.height = 3.5}
gaussian_sd <- function(sensitivity, epsilon) {
(sensitivity * sqrt(2 * log(1.25 / 10^-7))) / epsilon
}
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = gaussian_sd(0.1, 1)),
xlim = c(-8, 8),
aes(color = "l2-sensitivity = 1")) +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = gaussian_sd(0.5, 1)),
xlim = c(-8, 8),
aes(color = "l2-sensitivity = 2")) +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = gaussian_sd(1, 1)),
xlim = c(-8, 8),
aes(color = "l2-sensitivity = 3")) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank()) +
labs(title = "Gaussian sanitizer with different sensitivities",
subtitle = "Epsilon = 1, delta = 10^-7",
color = "Distribution")
```
```{r echo = FALSE, fig.height = 3.5}
ggplot() +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = gaussian_sd(1, 0.5)),
xlim = c(-8, 8),
aes(color = "Epsilon = 0.1")) +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = gaussian_sd(1, 1)),
xlim = c(-8, 8),
aes(color = "Epsilon = 0.5")) +
geom_function(fun = function(x) dnorm(x, mean = 0, sd = gaussian_sd(1, 2)),
xlim = c(-8, 8),
aes(color = "Epsilon = 1")) +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank()) +
labs(title = "Gaussian sanitizer with different epsilons",
subtitle = "l2-Global Sensitivity = 1, delta = 10^-7",
color = "Distribution")
```
#### Technical
**Gaussian Mechanism** [@dwork2014algorithmic]: The Gaussian Mechanism satisfies $(\epsilon,\delta)$-DP by adding Gaussian noise with zero mean and variance, $\sigma^2$, such that
$$\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k)$$
where $\eta_1,\ldots,\eta_k$ are independently drawn and $\sigma=\frac{\Delta_2(M)\sqrt{2 \log(1.25/\delta)}}{\epsilon}$.
This sanitizer includes two parameters: $\epsilon$ and $\delta$. We can think of $\delta$ as a small probability that the bound created by $\epsilon$ does not hold.
The Gaussian sanitizer uses $l_2$-Global Sensitivity.
:::
### Gaussian Sanitizer Example
We repeat the last example, except this time using the Gaussian sanitizer.
```{r}
# function to draw Gaussian noise for one statistic
gaussian_sanitizer <- function(sensitivity, epsilon, delta) {
# lambda (distribution width) is sensitivity/privacy loss budget
sigma <- (sensitivity * sqrt(2 * log(1.25 / delta))) / epsilon
# draw from Gaussian distribution
noise <- rnorm(n = 1, # draw one observation (adding noise to one statistic)
mean = 0,
sd = sigma) # scale based on l calculated above
return(noise)
}
```
Let's calculate the statistic without any noise.
```{r}
answer_conf <- sum(penguins$species == "Adelie")
answer_conf
```
Now, let's calculate the statistic with noise that satisfies the definition of $(\epsilon, \delta)$-differential privacy.
```{r}
set.seed(1)
answer_dp <- answer_conf + gaussian_sanitizer(sensitivity = 1, epsilon = 0.1, delta = 10^-7)
answer_dp
```
*Maybe we got a lucky or unlucky draw from the Normal distribution.* Let's calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.
```{r echo = FALSE}
set.seed(20220427)
tibble(
adelie_penguins = answer_conf + map_dbl(.x = 1:100, ~gaussian_sanitizer(sensitivity = 1, epsilon = 0.1, delta = 10^-7))
) |>
ggplot(aes(adelie_penguins)) +
geom_histogram() +
geom_vline(xintercept = answer_conf, color = "#55b748", size = 2) +
geom_vline(xintercept = answer_dp, color = "#ec008b", size = 2) +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
labs(
title = "100 Iterations of Counting Adelie Penguins",
subtitle = "Epsilon = 0.1; Truth in Green; Single Draw in Magenta"
)
```
The Gaussian sanitizer is worse than the Laplace sanitizer! So why do we even need a Gaussian sanitizer?
The Gaussian sanitizer can compose better for multiple queries. This is because the sum of two normally distributed random variables is normally distributed but the sum of two Laplacian distributed variables is not Laplacian.
## Important Theorems
As mentioned earlier, formal privacy requires methods to compose or account for the total privacy loss from each public data release or statistic. For example, composition or accounting allows the data curator to track the total privacy loss from multiple summary tables or multiple statistics requests from several data users. This is the main advantage of formal privacy compared to traditional SDC methods, which cannot quantify the total privacy loss. There are two main composition theorems: sequential and parallel. We also cover another important theorem (post-processing) that is essential in developing formally private methods.
### Sequential Composition Theorem
The sequential composition theorem allows the data users to calculate the privacy loss budget from multiple noisy statistics on the same part of the confidential data [@bun2016concentrated; @mcsherry2009privacy].
![An Example of Sequential Composition](www/images/lesson05-sequential.png){#fig-seq-comp width=600}
To help explain this concept, suppose we have establishment economic data set that reports the state of operation, the number of employees, and the average income for each establishment. We want to conduct three different analyses that cost $\epsilon_1=1$, $\epsilon_2=0.5$, and $\epsilon_3=0.5$, respectively. Since we are applying the three analyses on the entire data, sequential composition requires us to add up the individual privacy loss budgets for the total. i.e., $\epsilon_{total}=\epsilon_1+\epsilon_2+\epsilon_3=2$. @fig-seq-comp shows the application of sequential composition to our fictitious economic data set.
Mathematically, a mechanism, $\mathcal{M}_j$, provides $\epsilon_j$-DP. The sequence of $\mathcal{M}_j(X)$ applied on the same $X$ provides $\sum_{j=1}^J\epsilon_j$-DP.
### Sequential Composition Theorem Example
Let's return the penguins example from above. Suppose $\epsilon = 1$ and we want to count the number of "Adelie" penguins and the number of "Chinstrap" penguins.
```{r}
epsilon <- 1
set.seed(20220505)
sum(penguins$species == "Adelie") +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
sum(penguins$species == "Chinstrap") +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
```
For reference, let's look at the truth.
```{r}
sum(penguins$species == "Adelie")
sum(penguins$species == "Chinstrap")
```
### Parallel Composition Theorem
The parallel composition theorem allows data users to calculate the privacy loss budget from multiple noisy statistics on different or disjoint parts of the confidential data [@bun2016concentrated; @mcsherry2009privacy].
![An Example of Parallel Composition](www/images/lesson05-parallel.png){#fig-para-comp width=600}
Using the same example as before in @fig-seq-comp, suppose we apply three analyses to partitions of the data (i.e., the three different states) that cost $\epsilon_1=1$, $\epsilon_2=0.5$, and $\epsilon_3=0.5$, respectively. Since we are applying the three analyses on disjoint subsets of the data, parallel composition states that the total privacy loss budget is the maximum privacy loss budget of the three analyses. i.e., $\epsilon_{total}=\max(\epsilon_1,\epsilon_2,\epsilon_3)=1$. @fig-para-comp shows the application of sequential composition to our fictitious economic data set.
Mathematically, let $D_j$ be disjoint subsets of the input domain $D$. The sequence of $\mathcal{M}_j(X\cap D_j)$ provides $\max_{j \in \{1,\ldots,J\}} \epsilon_j$-DP
### Parallel Composition Theorem Example
Let's consider a larger data set with 53,940 observations about diamonds. Suppose we want to calculate a differenitally private histogram of diamond sizes with bins [0, 1], (1, 2], (2, 3], (3, 4], (4, 5], and (5,6] with $\epsilon = 0.1$.
```{r}
diamonds_conf <- count(diamonds, carat = ceiling(carat))
diamonds_conf
```
One approach is to use $\frac{\epsilon = 0.1}{6}$ for each of the six counting queries. This is based on sequential composition.
```{r}
epsilon <- 0.1
set.seed(10)
diamonds_conf <- bind_cols(
diamonds_conf,
sequential = diamonds_conf$n +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 6, n = 6)
)
diamonds_conf
```
The bins for `carat` partition the data set and each bin is a disjoint subset of the data. Therefore, we can use parallel composition and get more accurate differentially private counts!
```{r}
set.seed(11)
diamonds_conf <- bind_cols(
diamonds_conf,
parallel = diamonds_conf$n +
laplace_sanitizer(sensitivity = 1, epsilon = epsilon, n = 6)
)
diamonds_conf
```
### Post-Processing Theorem
Another important theorem is the post-processing theorem that allows the continued use of formally private information without losing the privacy guarantee [@bun2016concentrated; @dwork2006calibrating; @nissim2007smooth]. In other words, if someone modifies a formally private data set or statistic without using additional information from the confidential data, then that data set or statistic is still formally private.
For example, if the number of employees from a formally private method said there are 3.7 employees, then we could round that value to 4 without leaking more information. Simply put, the post-processing theorem makes the data usable after formally private noise is added.
Mathematically, if $\mathcal{M}$ is a sanitizer that satisfies $\epsilon$-DP and $g$ is any function, then $g\left(\mathcal{M}(X)\right)$ also satisfies $\epsilon$-DP.
Post-processing also provides the opportunity to improve utility. Data stewards can use available public or expert knowledge to reduce the amount of noise without accruing additional privacy loss. The public information can come from data released without formal privacy or from individuals who are comfortable sharing their information without noise. Rounding and eliminating impossible values like negative counts are common types of post-processing. There are also types of post-processing that can improve accuracy by leveraging information calculated from the same data set.
::: {.callout-note}
## A key takeaway
Formal privacy is transparent and allows users to account for the noise introduced into statistics. Post-processing can give up some of this transparency and make it more difficult to account for the noise added to statistics.
:::
## Exercise 3
Consider a simulated data set with information about small businesses (0-20 employees) in Texas and Vermont.
```{r}
set.seed(20220509)
small_businesses <- bind_rows(
Texas = tibble(
employees = rpois(n = 100010, lambda = 10),
income = rlnorm(n = 100010, meanlog = 10, sdlog = 2)
),
Vermont = tibble(
employees = rpois(n = 403, lambda = 10),
income = rlnorm(n = 403, meanlog = 10, sdlog = 2)
),
.id = "state"
) |>
mutate(employees = if_else(employees > 20, 20L, employees))
```
::: {.panel-tabset}
## <font color="#55b748">**Question**</font>
Using the Laplace sanitizer, calculate the number of small businesses in Texas and Vermont (count) with the overall $\epsilon = 0.1$. Use the parallel composition theorem.
```{r eval = FALSE}
ex3_conf <- count(small_businesses, state)
ex3_conf
set.seed(46)
bind_cols(
ex3_conf,
ex3_conf$n + laplace_sanitizer(
sensitivity = ### ______,
epsilon = ### ______,
n = 2
)
)
```
- Which state has more absolute error introduced into its count?
- Which state has more relative error introduced into its count?
## <font color="#55b748">**Solution**</font>
The observations from Texas and Vermont are disjoint, so we can use the full $\epsilon = 0.1$ for each statistics instead of splitting it across the statistics.
```{r}
ex3_conf <- count(small_businesses, state)
ex3_conf
set.seed(46)
bind_cols(
ex3_conf,
n_dp = ex3_conf$n + laplace_sanitizer(
sensitivity = 1,
epsilon = 0.1,
n = 2
)
)
```
The absolute error is larger for Texas, but the relative error is much bigger for Vermont.
:::
## Exercise 4
::: {.panel-tabset}
### <font color="#55b748">**Question**</font>
Using the Laplace sanitizer, calculate the number of employees in the entire data set (sum) with the overall $\epsilon = 0.1$. We know from auxiliary information that the number of employees varies from 0 to 20 because they are small businesses.
```{r eval = FALSE}
ex4_conf <- small_businesses |>
summarize(employees = sum(employees))
set.seed(47)
bind_cols(
ex4_conf,
employees_dp = ex4_conf$employees + laplace_sanitizer(
sensitivity = ### ______,
epsilon = ### ______,
n = 1
)
)
```
### <font color="#55b748">**Solution**</font>
```{r}
ex4_conf <- small_businesses |>
summarize(employees = sum(employees))
set.seed(47)
bind_cols(
ex4_conf,
employees_dp = ex4_conf$employees + laplace_sanitizer(
sensitivity = 20,
epsilon = 0.1,
n = 1
)
)
```
:::
## Other Formal Privacy Definitions
### Approximate Differential Privacy
Approximate Differential Privacy, also known as $(\epsilon, \delta)$-Differential Privacy is a relxation of $\epsilon$-Differential Privacy. We saw this definition above with the Gaussian sanitizer.
**$(\epsilon, \delta)$-Differential Privacy** [@dwork2006our]: A sanitization algorithm, $\mathcal{M}$, satisfies $(\epsilon, \delta)$-DP if for all $X, X'$ that are $d(X,X')=1$,
$$\Pr(\mathcal{M}( X) \in S)\le \exp(\epsilon) \Pr(\mathcal{M}( X')\in S) + \delta$$
where $\delta\in [0,1]$.
We can think of $\delta$ as a small probability that the bound created by $\epsilon$ does not hold. $\epsilon$-DP is a special case of $(\epsilon, \delta)$-DP when $\delta=0$.
### Zero-Concentrated Differential Privacy
Zero-Concentrated Differential Privacy is another relaxation of $\epsilon$-Differential Privacy. This definition is used by the Census Bureau for the 2020 Decennial Census.
**Zero-Concentrated Differential Privacy** [@bun2016concentrated]: A sanitization algorithm, $\mathcal{M}$, satisfies $(\xi, \rho)$-zero-concentrated differential privacy if for all $X, X'$ that are $d(X,X')=1$ and $\alpha\in (1, \infty)$,
$$D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\leq\xi+\rho\alpha$$
where $D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))$ is the $\alpha$-R\'enyi divergence % between the distribution of $\mathcal{M}(X)$ and the distribution of $\mathcal{M}(X')$, $\xi$ and $\rho$ are positive constants, and $\alpha \in (1,\infty)$.
Zero-Concentrated Differential Privacy, also known as R\'enyi Differential Privacy, only holds for counts.
## Unpacking $\epsilon$
Differential privacy states that the log of the ratio of the probability that any individual observation was in the data that generated the output vs. not in the data that generated the output is bounded by the value of $\epsilon$.
$$\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)$$
The bound is in exponential units, so modest increases in $\epsilon$ correspond with large increases in the ratio of the probabilities.
Early differential privacy researchers thought $\epsilon = 1$ or $\epsilon = 2$ were upper bounds on $\epsilon$. Today, much higher values of $\epsilon$ are used. The April 2021 Decennial Census demonstration data used $\epsilon = 4.0$ and $\epsilon = 10.3$ for the person-level file. The Decennial Census ended up using $\epsilon = 17.14$ for the person-level file and $\epsilon = 2.47$ for the housing unit data, with $\delta = 10^{−10}$ for each.
Let's consider the ratios of the probabilities for different values of $\epsilon$:
```{r echo = FALSE}
tibble(
epsilon = c(0.25, 0.5, 0.75, 1, 2, 4, 6, 8, 10.3, 17.14)
) |>
mutate(ratio = round(exp(epsilon)))
```
*It is tough to reason what a ratio of __`r round(exp(17.14))`__ even means.*
## Key Takeaways
* Differential privacy places a bound on the amount of information released under extreme assumptions about the knowledge of an attacker and their computing power.
* Global sensitivity measures how much a statistic can change with the addition or removal of the most extreme possible value.
* Sanitizers, like the Laplace sanitizer, satisfy differential privacy by adding a specific amount of random noise to statistics.
* Higher values of $\epsilon$ mean more information is potentially released.
* Sanitizers applied to statistics with higher global sensitivity require more noise to satisfy a definition of differential privacy than with statistics with lower global sensitivity.
## Bonus Exercises
### Exercise 5
::: {.panel-tabset}
#### <font color="#55b748">**Question**</font>
The Laplace sanitizer uses l2-global sensitivity.
- True
- False
#### <font color="#55b748">**Solution**</font>
The Laplace sanitizer uses l2-global sensitivity.
- True
- **False**
:::
### Exercise 6
::: {.panel-tabset}
#### <font color="#55b748">**Question**</font>
This theorem improves accuracy when data can be broken into disjoint subsets.
- Sequential composition theorem
- Postprocessing theorem
- Parallel composition theorem
#### <font color="#55b748">**Solution**</font>
This theorem improves accuracy when data can be broken into disjoint subsets.
- Sequential composition theorem
- Postprocessing theorem
- **Parallel composition theorem**
:::