08_expectations.tex

\chapter{Expectations}
Up to this point, our exploration of linear models only relied on least squares
and projections. We begin now discussing the statistical properties of our
estimators. We start by defining expected values. We assume that the reader
has basic univariate mathematical statistics.

\section{Expected values}

\href{https://www.youtube.com/watch?v=6WKTzqZQgJE&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=38}{Watch this video before beginning.}

If $X$ is a random variable having density funciton $f$, 
the $k^{th}$ moment is defined as 
$$
E[X] = \int_{-\infty}^{\infty} x^k f(x) dx.
$$
In the multivariate case where $\bX$ is a random vector
then the $k^th$ moment of element $i$ of the vector is 
given by 
$$
E[X_i^k] = \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} x_i^k f(x_1, \ldots, x_n) dx_1, \ldots, dx_n.
$$
It is worth asking if this definition is consistent with all
of the subdistributions defined by the subvectors of $\bX$. 
Let $i_1, \ldots, i_p$ is any subset of indices of $1,\ldots, n$ and
$i_{p+1}, \ldots, i_{n}$ are the remaining, then the 
joint distribution of $(X_{i_1},\ldots, X_{i_p})^t$ is 
$$
g(x_{i_1}, \ldots, x_{i_p}) = \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty}
f(x_1, \ldots, x_n) dx_{i_{p+1}}, \ldots, dx_{i_{n}}.
$$
The $k^{th}$ moment of $X_{i_j}$ for $j \in \{1,\ldots, p\}$ is equivalently:
\begin{eqnarray*}
E[X_{i_j}] & = & \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty}
x_{i_j}^k g(x_{i_1}, \ldots, x_{i_p}) dx_{i_1}, \ldots, d_{x_{ip}} \\
& = & \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty}
x_{i_j}^k f(x_1, \ldots, x_n) dx_1, \ldots, dx_n.
\end{eqnarray*}
(HW, prove this.) Thus, if we know only the marginal distribution 
of $X_{i_j}$ or any level of joint information, the expected value is the same.


If $\bX$ is any random vector or matrix, the $E[\bX]$ is 
simply the elementwise expected value defined above. Often
we will write $E[\bX] = \bmu$, or some other Greek letter,
adopting the notation that population parameters are Greek.
Standard notation is hindered somewhat in that uppercase letters
are typically used for random values, though are also used
for matrices. We hope that the context will eliminate confusion.

\href{https://www.youtube.com/watch?v=GgNUixhQ6oI&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=39}{Watch this video before beginning.}

Expected value rules translate well in the multivariate settings.
If $\bA$, $\bB$, $\bC$ are vectors or matrices that satisfy the operations then 
$$
E[\bA \bX + \bB \bY + \bC] = \bA E[\bX] + \bB E[\bY] + \bC.
$$
Further, expected values commute with transposes and traces
$$
E[\bX^t] = E[\bX]^t
$$
and 
$$
E[\mbox{tr}(\bX)] = \mbox{tr}(E[\bX]).
$$

\section{Variance}

\href{https://www.youtube.com/watch?v=Z5L0dU6Chmc&index=40&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}

The multivariate variance of random vector $\bX$ is defined as 
$$
\Var(\bX) = \bSigma = E[(\bX - \bmu)(\bX - \bmu)^t].
$$
Direct use of our matrix rules for expected values gives us the
analog of the univariate shortcut formula
$$
\bSigma = E[\bX\bX^t] - \bmu \bmu^t.
$$
Variance satisfy the properties 
$$
\Var(\bA \bX + \bB) = \bA \Var(\bX) \bA^t.
$$

\section{Multivariate covariances}

\href{https://www.youtube.com/watch?v=mddVO0zW64U&index=41&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}

The multivariate covariance is given by 
$$
\Cov(\bX, \bY) = E[(\bX - \bmu_x)(\bY - \bmu_y)^t]
= E[\bX \bY^t] - \bmu_x \bmu_y^t.
$$
This definition applies even if $\bX$ and $\bY$ are of different length.
Notice the multivariate covariance is not symmetric in its arguments. 
Moreover, 
$$
\Cov(\bX, \bX) = \Var(\bX).
$$

Covariances satisfy some useful rules in that
$$
\Cov(\bA \bX, \bB \bY) = \bA \Cov(\bX, \bY) \bB^t
$$
and
$$
\Cov(\bX + \bY, \bZ) = \Cov(\bX, \bY) + \Cov(\bX, \bZ)
$$

Multivariate covariances are useful for sums of random vectors. 
$$
\Var(\bX + \bY) = \Var(\bX) + \Var(\bY) + \Cov(\bX, \bY) + \Cov(\bY, \bX).
$$

A nifty fact from covariances is that the covariance of $\bA \bX$ and
$\bB \bX$ is $\bA \bSigma \bB^t$. Thus $\bA \bX$ and $\bB \bX$ 
are uncorrelated iff $\bA \bSigma \bB^t = \bzero.$

\section{Quadratic form moments}
\label{sec:qfm}

\href{https://www.youtube.com/watch?v=gdyG8FSxlqc&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=42}{Watch this video before beginning.}

Let $\bX$ be from a distribution with mean $\bmu$ and variance
$\bSigma$. Then 
$$
E[\bX^t \bA \bX] = \bmu^t \bA \bmu + \mbox{tr}(\bA \bSigma).
$$
Proof
\begin{eqnarray*}
E[\bX^t \bA \bX] & = & E[\mbox{tr}(\bX^t \bA \bX)]\\
& = & E[\mbox{tr}(\bA \bX \bX^t )]\\
& = &  \mbox{tr}(E[\bA  \bX \bX^t])\\
& = & \mbox{tr}(\bA E[\bX \bX^t])\\
& = & \mbox{tr}\{\bA [\Var(\bX) + \bmu \bmu^t]\}\\
& = & \mbox{tr}\{\bA\bSigma + \bA \bmu \bmu^t\}\\
& = & \mbox{tr}(\bA\bSigma) + \mbox{tr}(\bA \bmu \bmu^t)\\
& = & \mbox{tr}(\bA\bSigma) + \mbox{tr}(\bmu^t \bA \bmu )\\
& = & \mbox{tr}(\bA\bSigma) + \bmu^t \bA \bmu \\
\end{eqnarray*}

\section{BLUE} 

\href{https://www.youtube.com/watch?v=oeN8IzLFHls&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=43}{Watch this video before beginning.}

Now that we have moments, we can discuss mean and variance properties of the 
least squares estimators. Particularly, note that if $Y$ satisfies
$E[\bY] = \bX \bbeta$ and $\Var(Y) = \sigma^2 \bI$ then, $\hat \bbeta$ 
satisfies: 
$$E[\hat \bbeta] = \xtxinv \bX^t E[\bY] = \xtxinv \xtx \bbeta = \bbeta.$$
Thus, under these conditions $\hat \bbeta$ is unbiased. In addition, we have
that 
$$
\Var(\hat \bbeta) = \Var\{\xtxinv \bX^t \bY\} = \xtxinv \bX^t \Var(\bY) \bX \xtxinv
= \xtxinv \sigma^2. 
$$

We can extend these results to linear contrasts of $\bbeta$ 
to say that $\bq^t \hat \bbeta$ is the {\it best}
estimator of $\bq^t \bbeta$ in the sense of minimizing the variance among
linear (in $\bY$) unbiased estimators. It is important to consider
unbiased estimators, since we could always minimize the variance
by defining an estimator to be constant (hence variance 0). If one
removes the restriction of unbiasedness, then minimum variance cannot
be the definition of ``best''. Often one then looks to mean squared
error, the squared bias plust the variance, instead. In what follows
we only consider linear unbiased estimators.

We give Best Linear Unbiased Estimators the acronym
BLUE. It is remarkable easy to prove the result. 

Consider estimating $\bq^t \bbeta$.
Clearly, $\bq^t \hat \bbeta$ is both unbiased and linear in $\bY$. 
Also note that $\Var(\bq^t \hat \bbeta) = \bq^t \xtxinv \bq \sigma^2$. 
Let $\bk^t \bY$ be another linear unbiased estimator, so that
$E[\bk^t \bY] = \bq^t \bbeta$. But, $E[\bk^t \bY] = \bk^t \bX \bbeta$. 
It follows that since  $\bq^t \bbeta = \bk^t \bX \bbeta$ must hold for all possible $\bbeta$, we have that
$\bk^t \bX = \bq^t$. Finally note that
$$
\Cov(\bq^t \hat \bbeta, \bk^t \bY)
= \bq^t \xtxinv \bX^t \bk^t \sigma^2.
$$
Since $\bk^t \bX = \bq^t$, we have that
$$
\Cov(\bq^t \hat \bbeta, \bk^t \bY) = \Var(\bq^t \bbeta).
$$
Now we can execute the proof easily. 
\begin{eqnarray*}
\Var(\bq^t \hat \bbeta - \bk^t \bY) & = & 
\Var(\bq^t \hat \bbeta) + \Var(\bk^t \bY) - 2 \Cov(\bq^t \hat \bbeta, \bk^t \bY) \\
& = & \Var(\bk^t \bY) - \Var(\bq^t \hat \bbeta) \\ 
& \geq & 0.
\end{eqnarray*}
Here the final inequality arises as variances have to be non-negative. Then we have
that $\Var(\bk^t \bY) \geq \Var(\bq^t \hat \bbeta)$ proving the result.

Notice, normality was not required at any point in the proof, only restrictions
on the first two moments. In what follows, we'll see the consequences of
assuming normality.