forked from bcaffo/lm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08_expectations.tex
200 lines (172 loc) · 7.48 KB
/
08_expectations.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
\chapter{Expectations}
Up to this point, our exploration of linear models only relied on least squares
and projections. We begin now discussing the statistical properties of our
estimators. We start by defining expected values. We assume that the reader
has basic univariate mathematical statistics.
\section{Expected values}
\href{https://www.youtube.com/watch?v=6WKTzqZQgJE&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=38}{Watch this video before beginning.}
If $X$ is a random variable having density funciton $f$,
the $k^{th}$ moment is defined as
$$
E[X] = \int_{-\infty}^{\infty} x^k f(x) dx.
$$
In the multivariate case where $\bX$ is a random vector
then the $k^th$ moment of element $i$ of the vector is
given by
$$
E[X_i^k] = \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} x_i^k f(x_1, \ldots, x_n) dx_1, \ldots, dx_n.
$$
It is worth asking if this definition is consistent with all
of the subdistributions defined by the subvectors of $\bX$.
Let $i_1, \ldots, i_p$ is any subset of indices of $1,\ldots, n$ and
$i_{p+1}, \ldots, i_{n}$ are the remaining, then the
joint distribution of $(X_{i_1},\ldots, X_{i_p})^t$ is
$$
g(x_{i_1}, \ldots, x_{i_p}) = \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty}
f(x_1, \ldots, x_n) dx_{i_{p+1}}, \ldots, dx_{i_{n}}.
$$
The $k^{th}$ moment of $X_{i_j}$ for $j \in \{1,\ldots, p\}$ is equivalently:
\begin{eqnarray*}
E[X_{i_j}] & = & \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty}
x_{i_j}^k g(x_{i_1}, \ldots, x_{i_p}) dx_{i_1}, \ldots, d_{x_{ip}} \\
& = & \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty}
x_{i_j}^k f(x_1, \ldots, x_n) dx_1, \ldots, dx_n.
\end{eqnarray*}
(HW, prove this.) Thus, if we know only the marginal distribution
of $X_{i_j}$ or any level of joint information, the expected value is the same.
If $\bX$ is any random vector or matrix, the $E[\bX]$ is
simply the elementwise expected value defined above. Often
we will write $E[\bX] = \bmu$, or some other Greek letter,
adopting the notation that population parameters are Greek.
Standard notation is hindered somewhat in that uppercase letters
are typically used for random values, though are also used
for matrices. We hope that the context will eliminate confusion.
\href{https://www.youtube.com/watch?v=GgNUixhQ6oI&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=39}{Watch this video before beginning.}
Expected value rules translate well in the multivariate settings.
If $\bA$, $\bB$, $\bC$ are vectors or matrices that satisfy the operations then
$$
E[\bA \bX + \bB \bY + \bC] = \bA E[\bX] + \bB E[\bY] + \bC.
$$
Further, expected values commute with transposes and traces
$$
E[\bX^t] = E[\bX]^t
$$
and
$$
E[\mbox{tr}(\bX)] = \mbox{tr}(E[\bX]).
$$
\section{Variance}
\href{https://www.youtube.com/watch?v=Z5L0dU6Chmc&index=40&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}
The multivariate variance of random vector $\bX$ is defined as
$$
\Var(\bX) = \bSigma = E[(\bX - \bmu)(\bX - \bmu)^t].
$$
Direct use of our matrix rules for expected values gives us the
analog of the univariate shortcut formula
$$
\bSigma = E[\bX\bX^t] - \bmu \bmu^t.
$$
Variance satisfy the properties
$$
\Var(\bA \bX + \bB) = \bA \Var(\bX) \bA^t.
$$
\section{Multivariate covariances}
\href{https://www.youtube.com/watch?v=mddVO0zW64U&index=41&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}
The multivariate covariance is given by
$$
\Cov(\bX, \bY) = E[(\bX - \bmu_x)(\bY - \bmu_y)^t]
= E[\bX \bY^t] - \bmu_x \bmu_y^t.
$$
This definition applies even if $\bX$ and $\bY$ are of different length.
Notice the multivariate covariance is not symmetric in its arguments.
Moreover,
$$
\Cov(\bX, \bX) = \Var(\bX).
$$
Covariances satisfy some useful rules in that
$$
\Cov(\bA \bX, \bB \bY) = \bA \Cov(\bX, \bY) \bB^t
$$
and
$$
\Cov(\bX + \bY, \bZ) = \Cov(\bX, \bY) + \Cov(\bX, \bZ)
$$
Multivariate covariances are useful for sums of random vectors.
$$
\Var(\bX + \bY) = \Var(\bX) + \Var(\bY) + \Cov(\bX, \bY) + \Cov(\bY, \bX).
$$
A nifty fact from covariances is that the covariance of $\bA \bX$ and
$\bB \bX$ is $\bA \bSigma \bB^t$. Thus $\bA \bX$ and $\bB \bX$
are uncorrelated iff $\bA \bSigma \bB^t = \bzero.$
\section{Quadratic form moments}
\label{sec:qfm}
\href{https://www.youtube.com/watch?v=gdyG8FSxlqc&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=42}{Watch this video before beginning.}
Let $\bX$ be from a distribution with mean $\bmu$ and variance
$\bSigma$. Then
$$
E[\bX^t \bA \bX] = \bmu^t \bA \bmu + \mbox{tr}(\bA \bSigma).
$$
Proof
\begin{eqnarray*}
E[\bX^t \bA \bX] & = & E[\mbox{tr}(\bX^t \bA \bX)]\\
& = & E[\mbox{tr}(\bA \bX \bX^t )]\\
& = & \mbox{tr}(E[\bA \bX \bX^t])\\
& = & \mbox{tr}(\bA E[\bX \bX^t])\\
& = & \mbox{tr}\{\bA [\Var(\bX) + \bmu \bmu^t]\}\\
& = & \mbox{tr}\{\bA\bSigma + \bA \bmu \bmu^t\}\\
& = & \mbox{tr}(\bA\bSigma) + \mbox{tr}(\bA \bmu \bmu^t)\\
& = & \mbox{tr}(\bA\bSigma) + \mbox{tr}(\bmu^t \bA \bmu )\\
& = & \mbox{tr}(\bA\bSigma) + \bmu^t \bA \bmu \\
\end{eqnarray*}
\section{BLUE}
\href{https://www.youtube.com/watch?v=oeN8IzLFHls&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=43}{Watch this video before beginning.}
Now that we have moments, we can discuss mean and variance properties of the
least squares estimators. Particularly, note that if $Y$ satisfies
$E[\bY] = \bX \bbeta$ and $\Var(Y) = \sigma^2 \bI$ then, $\hat \bbeta$
satisfies:
$$E[\hat \bbeta] = \xtxinv \bX^t E[\bY] = \xtxinv \xtx \bbeta = \bbeta.$$
Thus, under these conditions $\hat \bbeta$ is unbiased. In addition, we have
that
$$
\Var(\hat \bbeta) = \Var\{\xtxinv \bX^t \bY\} = \xtxinv \bX^t \Var(\bY) \bX \xtxinv
= \xtxinv \sigma^2.
$$
We can extend these results to linear contrasts of $\bbeta$
to say that $\bq^t \hat \bbeta$ is the {\it best}
estimator of $\bq^t \bbeta$ in the sense of minimizing the variance among
linear (in $\bY$) unbiased estimators. It is important to consider
unbiased estimators, since we could always minimize the variance
by defining an estimator to be constant (hence variance 0). If one
removes the restriction of unbiasedness, then minimum variance cannot
be the definition of ``best''. Often one then looks to mean squared
error, the squared bias plust the variance, instead. In what follows
we only consider linear unbiased estimators.
We give Best Linear Unbiased Estimators the acronym
BLUE. It is remarkable easy to prove the result.
Consider estimating $\bq^t \bbeta$.
Clearly, $\bq^t \hat \bbeta$ is both unbiased and linear in $\bY$.
Also note that $\Var(\bq^t \hat \bbeta) = \bq^t \xtxinv \bq \sigma^2$.
Let $\bk^t \bY$ be another linear unbiased estimator, so that
$E[\bk^t \bY] = \bq^t \bbeta$. But, $E[\bk^t \bY] = \bk^t \bX \bbeta$.
It follows that since $\bq^t \bbeta = \bk^t \bX \bbeta$ must hold for all possible $\bbeta$, we have that
$\bk^t \bX = \bq^t$. Finally note that
$$
\Cov(\bq^t \hat \bbeta, \bk^t \bY)
= \bq^t \xtxinv \bX^t \bk^t \sigma^2.
$$
Since $\bk^t \bX = \bq^t$, we have that
$$
\Cov(\bq^t \hat \bbeta, \bk^t \bY) = \Var(\bq^t \bbeta).
$$
Now we can execute the proof easily.
\begin{eqnarray*}
\Var(\bq^t \hat \bbeta - \bk^t \bY) & = &
\Var(\bq^t \hat \bbeta) + \Var(\bk^t \bY) - 2 \Cov(\bq^t \hat \bbeta, \bk^t \bY) \\
& = & \Var(\bk^t \bY) - \Var(\bq^t \hat \bbeta) \\
& \geq & 0.
\end{eqnarray*}
Here the final inequality arises as variances have to be non-negative. Then we have
that $\Var(\bk^t \bY) \geq \Var(\bq^t \hat \bbeta)$ proving the result.
Notice, normality was not required at any point in the proof, only restrictions
on the first two moments. In what follows, we'll see the consequences of
assuming normality.