-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathcode_design.qmd
262 lines (189 loc) · 10.2 KB
/
code_design.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# [Code design]{.red} {#sec-code-design}
{{< include macros.qmd >}}
{{< include macros_exchangeability.qmd >}}
{{< include macros_opm.qmd >}}
Before starting, let's agree on some terminology in order not to get confused in the discussion below.
- We shall call [*task*]{.blue} a repetitive inference problem with a specified set of units and variates. For instance, a task could be the consecutive prediction of the urgency of incoming patients, given their mean of transportation. We assume that the details of the variates, such as their domain, are well specified. Possibly also a set of data from other patients is available, which we call "training data".
- We shall call [*application*]{.blue} or [*instance*]{.blue} of the task a single inference about a specific new unit, for example a new incoming patient.
## Range of use of the code {#sec-code-range}
The concrete formulae discussed in the previous [chapter @sec-dirichlet-mix] can be put into code, for use in different tasks involving only nominal variates. Software of this kind can in principle be written to allow for some or all of the versatility discussed in §§ [-@sec-categ-probtheory]--[-@sec-underlying-distribution], for example the possibility of taking care (in a first-principled way!) of partially missing training data. But the more versatile we make the software, the more memory, processing power, and computation time it will require.
Roughly speaking, more versatility corresponds to calculations of the joint probability
::::{.column-page-right}
:::{.callout-note}
##
$$
\P(
\blue
Z_{L}\mo z_{L}
\and
\dotsb \and
Z_{1}\mo z_1
\black
\| \yD
)
=
\frac{1}{\amax-\amin+1}
\sum_{\ya=\amin}^{\amax}
\frac{
\prod_{\bz} \bigl(\frac{2^{\ya}}{M} + \#\bz - 1\bigr)!
}{
\bigl(2^{\ya} + L -1 \bigr)!
}
\cdot
\frac{
\bigl(2^{\ya} -1 \bigr)!
}{
{\bigl(\frac{2^{\ya}}{M} - 1\bigr)!}^M
}
\quad
$$ {#eq-main-joint}
:::
::::
for more values of the quantities $\blue Z_1, Z_2, \dotsc$. For instance, if data about unit #4 are missing, then we need to calculate the joint probability above for several (possibly all) values of $\blue Z_4$. If data about two units are missing, then we need to do an analogous calculation for all possible *combinations* of values; and so on.
For our prototype, let's forgo versatility about units used as training data. From now on we abbreviate the set of training data as
:::{.column-margin}
Recall that $\bZ$ denotes all (nominal) variates of the population
:::
$$
\data \defd
(\green
Z_{N}\mo z_{N} \land \dotsb \land
Z_{2}\mo z_2 \land
Z_{1}\mo z_{1}
\black)
$$
where $\blue z_N, \dotsc, z_2, z_1$ are specific values, stored in some training dataset. No values are missing.
Since the training $\data$ are given and fixed in a task, we omit the suffix "${}_{N+1}$" that we have often used to indicate a "new" unit. So "$\blue Z\mo z$" simply refers to the variate $\bZ$ in a new application of the task.
We allow for full versatility in every new instance. This means that we can accommodate, *on the spot at each new instance*, what the predictand variates are, and what the predictor variates (if any) are. For example, if the population has three variates $\bZ=(\bA \and \bB \and \bC)$, our prototype can calculate, at each new application, inferences such as
- $P(\bB\mo\dotso\|\data \and \yD)$: any one predictand variate, no predictors
- $P(\bA\mo\dotso \and \bC\mo\dotso\|\data \and \yD)$: any two predictand variates, no predictors
- $P(\bA\mo\dotso \and \bB\mo\dotso \and \bC\mo\dotso\|\data \and \yD)$: all three variates
- $P(\bB\mo\dotso\|\bA\mo\dotso \and \data \and \yD)$: any one predictand variate, any other one predictor
- $P(\bB\mo\dotso\| \bA\mo\dotso \and \bC\mo\dotso \and\data \and \yD)$: any one predictand variate, any other two predictors
- $P(\bA\mo\dotso \and \bC\mo\dotso\|\bB\mo\dotso \and \data \and \yD)$: any two predictand variates, any other one predictor
## Code design and computations needed {#sec-code-computations}
To enjoy the versatility discussed above, the code needs to compute
::::{.column-page-right}
:::{.callout-note}
##
$$
\P(
\blue Z \mo z
\and
\green\data
\black \| \yD)
=
\frac{1}{\amax-\amin+1}
\sum_{\ya=\amin}^{\amax}
\Biggl(\frac{2^{\ya}}{M} + {\green\#}\bz\Biggr)
\cdot
\frac{
\prod_{\bz} \bigl(\frac{2^{\ya}}{M} + {\green\# z} - 1\bigr)!
}{
\bigl(2^{\ya} + N \bigr)!
}
\cdot
\frac{
\bigl(2^{\ya} -1 \bigr)!
}{
{\bigl(\frac{2^{\ya}}{M} - 1\bigr)!}^M
}
$$ {#eq-objectP}
for all possible values $\bz$, where ${\green\#}\bz$ is the number of times value $\bz$ appears **in the training [data]{.green}**, and $N = \sum_{\green z}{\green\# z}$ is the number of training data
:::
::::
This formula is just a rewriting of formula (@eq-main-joint) for $L=N+1$, simplified by using the property of the factorial
$$(a+1)! = (a+1) \cdot a!$$
But the computation of formula (@eq-objectP) (for all values of $\bz$) must be done *only once* for a given task. For a new application we only need to combine these already-computed probabilities via sums and fractions. For example, in the three-variate case above, if in a new application we need to forecast $\red A\mo a$ given $\yellow C\mo c$, then we calculate
::::{.column-page-inset-right}
:::{.callout-note}
##
(example with $Z \defd ({\red A}, {\blue B}, {\yellow C})$)
$$
P(\red A\mo a \black \|\yellow C\mo c \black \and \data \and \yD)
=
\frac{
\sum_{\blue b}
P(\red A\mo a \black \and \blue B\mo b \black \and \yellow C\mo c \black \and \data \| \yD)
}{
\sum_{\purple \alpha}\sum_{\blue b}
P(\red A\mo {\purple \alpha} \black \and \blue B\mo b \black \and \yellow C\mo c \black \and \data \| \yD)
}
\quad
$$ {#eq-forecast}
:::
::::
where all $P(\red A\mo\dotso \black \and \blue B\mo\dotso \black \and \yellow C\mo\dotso \black \and \data \| \yD)$ are already computed.
\
Our prototype software must therefore include two main functions, which we can call as follows:
- `buildagent()` ([see code](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/buildagent.R))
: computes $\green\#\bz$ for all values $\bz$, as well as the multiplicative factors
$$
\frac{
\bigl(2^{\ya} -1 \bigr)!
}{
\bigl(2^{\ya} + N \bigr)!
\cdot
{\bigl(\frac{2^{\ya}}{M} - 1\bigr)!}^M
}
$$
for all $k$, in (@eq-objectP). This computation is done once and for all in a given task, using the training $\data$ and the metadata $\yD$ provided. The result can be stored in an array or similar object, which we shall call an `agent`-class object.
- `infer()` ([see code](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/infer.R))
: computes probabilities such as (@eq-forecast) at each new instance, using the stored `agent`-class object as well as the predictor variates and values provided with that instance, and the predictand variates requested at that instance.
\
We shall also include four additional functions for convenience:
- `guessmetadata()`
: builds a preliminary metadata file, encoding the background information $\yD$, from some dataset.
- `decide()`
: makes a decision according to expected-utility maximization ([chapter @sec-basic-decisions]), using probabilities calculated with `infer()` and utilities.
- `rF()`
: draws one or more possible full-population frequency distribution $\vf$, according to the updated degree of belief $\p(F\mo\vf \| \data \and \yD)$
- `plotFsamples1D()`
: plots, as a generalized scatter plot, the possible full-population marginal frequency distributions for a single (not joint) predictand variate. If required it also also the final probability obtained with `infer()`.
- `mutualinfo()`
: calculates the mutual information ([§@sec-entropy-mutualinfo]) between any two sets of variates.
::::{.column-body-outset-right}
:::{.callout-caution}
## {{< fa user-edit >}} Exercise
Using the `and`-rule, prove (pay attention to the conditional "$\|$" bar):
$$
\frac{
\sum_{\blue b}
P(\red A\mo a \black \and \blue B\mo b \black \and \yellow C\mo c \black \and \data \| \yD)
}{
\sum_{\purple \alpha}\sum_{\blue b}
P(\red A\mo {\purple \alpha} \black \and \blue B\mo b \black \and \yellow C\mo c \black \and \data \| \yD)
}
=
\frac{
\sum_{\blue b}
P(\red A\mo a \black \and \blue B\mo b \black \and \yellow C\mo c \black \| \data \and \yD)
}{
\sum_{\purple \alpha}\sum_{\blue b}
P(\red A\mo {\purple \alpha} \black \and \blue B\mo b \black \and \yellow C\mo c \black \| \data \and \yD)
}
$$
\
This exercise shows that instead of
$$\P(\blue Z \mo z \black \and \green\data \black \| \yD)$$
we could calculate
$$
\P(
\blue Z \mo z
\black \|
\green\data
\black \and \yD)
$$
once for all possible values $\bz$, and use that. Mathematically and logically the two ways are completely equivalent. Numerically they can be different as regards precision or possible overflow errors. Using $\P( \blue Z \mo z \black \| \green\data \black \and \yD)$ would be convenient if our basic formula (@eq-main-joint) didn't contain the sum $\sum_k$ over the $k$ index. Our code shall instead use $\P(\blue Z \mo z \black \and \green\data \black \| \yD)$ because it leads to slightly more precision and speed in some tasks.
:::
::::
## Code optimization {#sec-code-optim}
The formulae of [chapter @sec-dirichlet-mix], if used as-written, easily lead to two kinds of computation problems. First, they generate overflows and `NaN`, owing to factorials and their divisions. Second, the products over variates may involve so many terms as to require a long computation time. In the end we would have to wait a long time just to receive a string of `NaN`s.
The first problem is dealt with by rewriting the formulae in terms of logarithms, and renormalizing numerators and denominators of fractions. See for example the lines defining `auxalphas` in the [`buildagent()`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/buildagent.R) function, and the line that redefines `counts` one last time in the [`infer()`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/infer.R) function.
The second problem is dealt with by reorganizing the sums as multiples of identical summands; see the lines working with `freqscounts` in the [`buildagent()`](https://github.com/pglpm/ADA511/blob/master/code/OPM-nominal/buildagent.R) function.
::::{.column-margin}
::: {.callout-tip}
## {{< fa rocket >}} For the extra curious
§6.1 in [*Numerical Recipes*](https://hvl.instructure.com/courses/28605/modules)
:::
::::