chapter_1.4.lyx

#LyX 2.0 created this file. For more info see http://www.lyx.org/
\lyxformat 413
\begin_document
\begin_header
\textclass scrbook
\begin_preamble
\setcounter{chapter}{1}
\usepackage{graphicx}
\usepackage{pict2e}
\usepackage{graphpap}
\usepackage{color}
\usepackage{bm}
\end_preamble
\use_default_options false
\maintain_unincluded_children false
\language english
\language_package default
\inputencoding auto
\fontencoding global
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize default
\spacing single
\use_hyperref false
\papersize default
\use_geometry false
\use_amsmath 1
\use_esint 1
\use_mhchem 1
\use_mathdots 1
\cite_engine basic
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\use_refstyle 0
\index Index
\shortcut idx
\color #008000
\end_index
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\paragraph_indentation default
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\end_header

\begin_body

\begin_layout Section
Derivatives in 
\begin_inset Formula $\mathbb{R}^{n}$
\end_inset


\end_layout

\begin_layout Standard
\begin_inset CommandInset label
LatexCommand label
name "sec:multivariate_derivatives"

\end_inset


\end_layout

\begin_layout Subsection
Partial Derivatives
\end_layout

\begin_layout Standard

\series bold
Example (Least Squares)
\end_layout

\begin_layout Standard
Consider the following problem: We are given a number of noisy measurements
 
\begin_inset Formula $y_{1},...,y_{m}\in\mathbb{R}$
\end_inset

 that we measured at corresponding base points 
\begin_inset Formula $x_{1},...,x_{m}\in\mathbb{R}$
\end_inset

.
 Now we want to model the 
\begin_inset Formula $y_{i}$
\end_inset

 as an affine function of 
\begin_inset Formula $x_{i}$
\end_inset

, i.e.
 we want to find a line 
\begin_inset Formula $g(x)=wx+b$
\end_inset

 such that 
\begin_inset Formula $y_{i}\approx g(x_{i})$
\end_inset

.
 Since your measurements are noisy, there is probably not a line that contains
 all points 
\begin_inset Formula $(x_{i},y_{i})$
\end_inset

.
 From all the other lines that match the points more or less well we have
 to choose one that matches our points 
\begin_inset Formula $(x_{i},y_{i})$
\end_inset

 best.
 The meaning of ''matching best'' is usually expressed by means of a loss
 function.
 The most common choice is the sum of the squared deviations from that line
\begin_inset Formula 
\begin{eqnarray}
\ell(w,b) & = & \sum_{i=1}^{m}\left(y_{i}-(wx_{i}+b)\right)².\label{eq:squared_loss}
\end{eqnarray}

\end_inset

This 
\emph on
squared loss
\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
squared loss
\end_layout

\end_inset


\emph default
 is a function of the line parameters 
\begin_inset Formula $w$
\end_inset

 and 
\begin_inset Formula $b$
\end_inset

.
 The approach of finding a line by minimizing (
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:squared_loss"

\end_inset

) is known as 
\emph on
least squares
\emph default
 and goes back to Gauß, who became famous for finding a lost comet with
 this approach.
\end_layout

\begin_layout Standard
\align right
\begin_inset Formula $\lhd$
\end_inset


\end_layout

\begin_layout Standard
Our loss is a function of two parameters, i.e.
 
\begin_inset Formula $\ell:\mathbb{R}²\rightarrow\mathbb{R}$
\end_inset

.
 So far, we only looked at finding the optima of functions 
\begin_inset Formula $f:\mathbb{R}\rightarrow\mathbb{R}$
\end_inset

.
 However, there are only a few changes in the rules for taking derivatives
 or finding optima when dealing with 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
multivariate functions
\end_layout

\end_inset

multivariate functions
\emph default
 
\begin_inset Formula $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$
\end_inset

.
\end_layout

\begin_layout Standard
The first change is obvious.
 Since our function has more than one variable, we must take the derivative
 with respect to more than one variable now.
 This kind of derivative is know as 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
partial derivative
\end_layout

\end_inset

partial derivative
\emph default
.
\end_layout

\begin_layout Subsubsection*
Definition (Partial Derivative):
\end_layout

\begin_layout Standard
Let 
\begin_inset Formula $f:\mathbb{R}^{n}\rightarrow\mathbb{R},\,(x_{1},...,x_{n})\mapsto f(x_{1},...,x_{n})$
\end_inset

 be a multivariate function.
 The partial derivative with respect to a 
\begin_inset Formula $x_{i}$
\end_inset

 is taking the normal derivative of 
\begin_inset Formula $f$
\end_inset

 with respect to 
\begin_inset Formula $x_{i}$
\end_inset

 while treating all other 
\begin_inset Formula $x_{j}$
\end_inset

, 
\begin_inset Formula $j\not=i$
\end_inset

 as constants.
 The partial derivative with respect to 
\begin_inset Formula $x_{i}$
\end_inset

 at the point 
\begin_inset Formula $\mathbf{z}=(z_{1},...,z_{n})$
\end_inset

 is denotes with the symbol 
\begin_inset Formula $\frac{\partial f}{\partial x_{i}}(\mathbf{z})$
\end_inset

, 
\begin_inset Formula $\frac{\partial}{\partial x_{i}}f(\mathbf{z})$
\end_inset

 or sometimes 
\begin_inset Formula $\frac{\partial}{\partial x_{i}}|_{\mathbf{z}}f$
\end_inset

.
\end_layout

\begin_layout Standard
\align right
\begin_inset Formula $\Diamond$
\end_inset


\end_layout

\begin_layout Standard
When dealing with a function of 
\begin_inset Formula $n$
\end_inset

 variables, we can compute 
\begin_inset Formula $n$
\end_inset

 partial derivatives.
 Computing all partial derivatives and assembling them in a vector gives
 us the so called 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
gradient
\end_layout

\end_inset

gradient.
\end_layout

\begin_layout Subsection
Gradient and Optima in 
\begin_inset Formula $\mathbb{R}^{n}$
\end_inset


\end_layout

\begin_layout Subsubsection*
Definition (Gradient):
\end_layout

\begin_layout Standard
The 
\begin_inset Formula $n$
\end_inset

-dimensional vector containing all partial derivatives of a function 
\begin_inset Formula $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$
\end_inset

 at the position 
\begin_inset Formula $\mathbf{z}$
\end_inset

 
\begin_inset Formula 
\begin{eqnarray*}
\nabla f(\mathbf{z}) & = & \left(\frac{\partial f}{\partial x_{1}}(\mathbf{z}),...,\frac{\partial f}{\partial x_{n}}(\mathbf{z})\right)
\end{eqnarray*}

\end_inset

 is called gradient of 
\begin_inset Formula $f$
\end_inset

 at 
\begin_inset Formula $\mathbf{z}$
\end_inset

.
 It is denoted with 
\begin_inset Formula $\nabla f(\mathbf{z})$
\end_inset

 or sometimes 
\begin_inset Formula $\nabla|_{\mathbf{z}}f$
\end_inset

.
\end_layout

\begin_layout Standard
\align right
\begin_inset Formula $\Diamond$
\end_inset


\end_layout

\begin_layout Subsubsection*

\series medium
Similarly to finding optima of 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
univariate functions
\end_layout

\end_inset

univariate functions
\emph default
 
\begin_inset Formula $f:\mathbb{R}\rightarrow\mathbb{R}$
\end_inset

, where 
\begin_inset Formula $f'(x)=0$
\end_inset

 was a necessary condition for 
\begin_inset Formula $x$
\end_inset

 being an optimum, a necessary condition for 
\begin_inset Formula $\mathbf{x}=(x_{1},...,x_{n})$
\end_inset

 being an optimum is that all partial derivatives vanish, i.e.
 
\begin_inset Formula $\nabla f(\mathbf{x})=(0,...,0)$
\end_inset

.
\end_layout

\begin_layout Standard

\series bold
Remark (Gradient Descent)
\end_layout

\begin_layout Standard
The gradient of a function 
\begin_inset Formula $f$
\end_inset

 at a point 
\begin_inset Formula $\mathbf{x}$
\end_inset

 has a remarkable property.
 It always points into the direction of steepest ascent of the function
 
\begin_inset Formula $f$
\end_inset

 at 
\begin_inset Formula $\mathbf{x}$
\end_inset

.
 This property is used in the so called 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
gradient descent
\end_layout

\end_inset

gradient descent
\emph default
 algorithm.
 Starting at a random position 
\begin_inset Formula $\mathbf{x}^{(0)}$
\end_inset

, gradient descent computes the gradient 
\begin_inset Formula $\nabla f(\mathbf{x}^{(0)})$
\end_inset

 of 
\begin_inset Formula $f$
\end_inset

 at 
\begin_inset Formula $\mathbf{x}^{(0)}$
\end_inset

 and makes a small step in its direction (or in the opposite direction,
 when minimizing a function).
 Let us assume that we want to minimize a function (for maximizing, the
 gradient is added and not subtracted).
 We start at a random position 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathbf{x}^{(0)}$
\end_inset

, compute the gradient 
\begin_inset Formula $\nabla f(\mathbf{x}^{(0)})$
\end_inset

 and obtain a new value for 
\begin_inset Formula $\mathbf{x}$
\end_inset

 via 
\begin_inset Formula $\mathbf{x}^{(1)}=\mathbf{x}^{(0)}-\alpha\nabla f(\mathbf{x}^{(0)})$
\end_inset

.
 At 
\begin_inset Formula $\mathbf{x}^{(1)}$
\end_inset

, we repeat this procedure.
 By that we obtain the general update rule 
\begin_inset Formula 
\begin{eqnarray*}
\mathbf{x}^{(t+1)} & = & \mathbf{x}^{(t)}-\alpha\nabla f(\mathbf{x}^{(t)}).
\end{eqnarray*}

\end_inset

The art, somehow, is to choose the scaling constant 
\begin_inset Formula $\alpha$
\end_inset

.
 This is usually done by 
\family default
\series default
\shape default
\size default
\emph on
\bar default
\noun default
\color inherit

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
line search
\end_layout

\end_inset

line search
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
 algorithms, that we do not describe here.
 However, choosing a small constant value for 
\begin_inset Formula $\alpha$
\end_inset

 also works for simple examples.
 The procedure is repeated until the gradient becomes zero, i.e.
 
\begin_inset Formula $\nabla f(\mathbf{x}^{(t)})=(0,...,0)$
\end_inset

.
 Gradient descent does not yield a global minimum.
 It simply gives you the local minimum that you reach when running down
 the surface defined by the function 
\begin_inset Formula $f$
\end_inset

 starting at 
\begin_inset Formula $\mathbf{x}^{(0)}$
\end_inset

.
 Intuitively, you can imagine gradient descent as a small ball, that rolls
 down a surface until it gets stuck in the next valley.
 
\end_layout

\begin_layout Standard
\align right
\begin_inset Formula $\lhd$
\end_inset


\end_layout

\begin_layout Standard

\series bold
Example (Least Squares cont'd)
\end_layout

\begin_layout Standard
Let us look again at our least squares problem.
 Since we want to find the minimum of 
\begin_inset Formula 
\begin{eqnarray*}
\ell(w,b) & = & \sum_{i=1}^{m}\left(y_{i}-(wx_{i}+b)\right)²,
\end{eqnarray*}

\end_inset

we need to compute the gradient 
\begin_inset Formula $\nabla\ell(w,b)$
\end_inset

, set it to zero and solve for 
\begin_inset Formula $w$
\end_inset

 and 
\begin_inset Formula $b$
\end_inset

.
\begin_inset Formula 
\begin{eqnarray*}
\frac{\partial\ell}{\partial w} & = & \frac{\partial}{\partial w}\sum_{i=1}^{m}\left(y_{i}-(wx_{i}+b)\right)²\\
 & \stackrel{\mbox{Summation Rule}}{=} & \sum_{i=1}^{m}\frac{\partial}{\partial w}\left(y_{i}-(wx_{i}+b)\right)²\\
 & \stackrel{\mbox{Chain Rule}}{=} & \sum_{i=1}^{m}2\left(y_{i}-(wx_{i}+b)\right)\cdot\frac{\partial}{\partial w}\left(y_{i}-(wx_{i}+b)\right)\\
 & = & \sum_{i=1}^{m}2\left(y_{i}-(wx_{i}+b)\right)\cdot-x_{i}\\
 & = & 2\sum_{i=1}^{m}wx_{i}²+bx_{i}-x_{i}y_{i}\\
\frac{\partial\ell}{\partial b} & = & \frac{\partial}{\partial b}\sum_{i=1}^{m}\left(y_{i}-(wx_{i}+b)\right)²\\
 & = & \sum_{i=1}^{m}2\left(y_{i}-(wx_{i}+b)\right)\cdot\frac{\partial}{\partial b}\left(y_{i}-(wx_{i}+b)\right)\\
 & = & \sum_{i=1}^{m}2\left(y_{i}-(wx_{i}+b)\right)\cdot-1\\
 & = & \sum_{i=1}^{m}2\left(wx_{i}+b-y_{i}\right).
\end{eqnarray*}

\end_inset


\end_layout

\begin_layout Standard
If we denote the mean of 
\begin_inset Formula $x_{i}$
\end_inset

 with 
\begin_inset Formula $\mu_{x}=\frac{1}{m}\sum_{i=1}^{m}x_{i}$
\end_inset

, the mean over 
\begin_inset Formula $y_{i}$
\end_inset

 with 
\begin_inset Formula $\mu_{y}=\frac{1}{m}\sum_{i=1}^{m}y_{i}$
\end_inset

, and splitting the sums, we can simplify the equations to 
\begin_inset Formula 
\begin{eqnarray*}
\frac{\partial\ell}{\partial w} & = & 2\left(w\sum_{i=1}^{m}x_{i}²+bm\mu_{x}-\sum_{i=1}^{m}x_{i}y_{i}\right)\\
\frac{\partial\ell}{\partial b} & = & 2\left(wm\mu_{x}+mb-m\mu_{y}\right)
\end{eqnarray*}

\end_inset

Setting 
\begin_inset Formula $\frac{\partial\ell}{\partial b}=0$
\end_inset

 and solving for 
\begin_inset Formula $b$
\end_inset

 then yields
\begin_inset Formula 
\begin{eqnarray*}
2\left(wm\mu_{x}+mb-m\mu_{y}\right) & = & 0\\
\Leftrightarrow b & = & \mu_{y}-w\mu_{x}.
\end{eqnarray*}

\end_inset

This equation has a simple interpretation.
 
\begin_inset Formula $b$
\end_inset

 is the mean deviation of 
\begin_inset Formula $y$
\end_inset

 from our line 
\begin_inset Formula $g(x)=wx$
\end_inset

 without offset 
\begin_inset Formula $b$
\end_inset

.
\end_layout

\begin_layout Standard
Plugging that into 
\begin_inset Formula $\frac{\partial\ell}{\partial w}$
\end_inset

 yields
\begin_inset Formula 
\begin{eqnarray*}
2\left(w\sum_{i=1}^{m}x_{i}²+bm\mu_{x}-\sum_{i=1}^{m}x_{i}y_{i}\right) & = & 2\left(w\sum_{i=1}^{m}x_{i}²+m\mu_{x}(\mu_{y}-w\mu_{x})-\sum_{i=1}^{m}x_{i}y_{i}\right)\\
 & = & 2\left(w\sum_{i=1}^{m}x_{i}²+m\mu_{x}\mu_{y}-wm\mu_{x}^{2}-\sum_{i=1}^{m}x_{i}y_{i}\right).
\end{eqnarray*}

\end_inset

If we remember that the variance 
\begin_inset Formula $\sigma_{x}²$
\end_inset

 of the 
\begin_inset Formula $x_{i}$
\end_inset

 can be computed via 
\begin_inset Formula $\sigma_{x}²=\frac{1}{m}\sum_{i=1}^{m}x_{i}²-\mu_{x}²$
\end_inset

 and if we denote the 
\emph on
covariance
\emph default
 between 
\begin_inset Formula $x$
\end_inset

 and 
\begin_inset Formula $y$
\end_inset

 with 
\begin_inset Formula $\sigma{}_{xy}=\frac{1}{m}\sum_{i=1}^{m}x_{i}y_{i}-\mu_{x}\mu_{y}$
\end_inset

, we can simplify the equation to
\end_layout

\begin_layout Standard
\begin_inset Formula 
\begin{eqnarray*}
2\left(w\sum_{i=1}^{m}x_{i}²+m\mu_{x}\mu_{y}-wm\mu_{x}^{2}-\sum_{i=1}^{m}x_{i}y_{i}\right) & = & 2\left(wm\sigma²_{x}-m\sigma{}_{xy}\right).
\end{eqnarray*}

\end_inset

Setting that equation to zero and solving for 
\begin_inset Formula $w$
\end_inset

 yields
\begin_inset Formula 
\begin{eqnarray*}
2\left(wm\sigma²_{x}-m\sigma{}_{xy}\right) & = & 0\\
\Leftrightarrow w & = & \frac{\sigma{}_{xy}}{\sigma²_{x}}.
\end{eqnarray*}

\end_inset

The term 
\begin_inset Formula $\frac{\sigma{}_{xy}}{\sigma²_{x}}$
\end_inset

 has again a simple interpretation: It is the correlation coefficient 
\begin_inset Formula $\rho=\frac{\sigma_{xy}}{\sqrt{\sigma_{x}²}\sqrt{\sigma_{y}²}}$
\end_inset

 times the the standard deviation 
\begin_inset Formula $\sigma_{y}=\sqrt{\sigma_{y}²}$
\end_inset

 of the 
\begin_inset Formula $y_{i}$
\end_inset

 divided by the standard deviation of the 
\begin_inset Formula $x_{i}$
\end_inset

, i.e.
 
\begin_inset Formula $w=\rho\frac{\sigma_{y}}{\sigma_{x}}$
\end_inset

.
 Plugging that into the equation for 
\begin_inset Formula $b$
\end_inset

 yields
\begin_inset Formula 
\begin{eqnarray*}
b & = & \mu_{y}-w\mu_{x}\\
 & = & \mu_{y}-\rho\frac{\sigma_{y}}{\sigma_{x}}\mu_{x}
\end{eqnarray*}

\end_inset

and we are left with the line equation 
\begin_inset Formula 
\begin{eqnarray*}
g(x) & = & wx+b\\
 & = & \rho\frac{\sigma_{y}}{\sigma_{x}}x+\mu_{y}-\rho\frac{\sigma_{y}}{\sigma_{x}}\mu_{x}\\
 & = & \rho\frac{\sigma_{y}}{\sigma_{x}}(x-\mu_{x})+\mu_{y}.
\end{eqnarray*}

\end_inset

If you think about it, this equation makes perfect sense.
 In order to obtain a estimate for the associated 
\begin_inset Formula $y$
\end_inset

 for a given 
\begin_inset Formula $x$
\end_inset

, we first subtract the mean from 
\begin_inset Formula $x$
\end_inset

 as estimated by our sample, normalize its variance to one by dividing by
 the standard deviation, multiply the result with the correlation coefficient
 
\begin_inset Formula $\rho$
\end_inset

, multiply the result with the standard deviation of 
\begin_inset Formula $y$
\end_inset

 to get the scale right and finally add the mean 
\begin_inset Formula $\mu_{y}$
\end_inset

 of 
\begin_inset Formula $y$
\end_inset

.
 
\end_layout

\begin_layout Standard
\align right
\begin_inset Formula $\lhd$
\end_inset


\end_layout

\begin_layout Standard
Unfortunately, we cannot check if the values for 
\begin_inset Formula $w$
\end_inset

 and 
\begin_inset Formula $b$
\end_inset

 really correspond to a minimum, yet.
 In order to do so we would need the concept of matrices and eigenvalues.
 Just for the sake of completeness, we quickly mention how to generalize
 the sufficient conditions for a minimum and a maximum to more than one
 dimension.
 
\end_layout

\begin_layout Standard
The second derivative of a function 
\begin_inset Formula $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$
\end_inset

 is a 
\begin_inset Formula $\mathbb{R}^{n\times n}$
\end_inset

 matrix.
 This is not surprising, since the first derivative with respect to all
 variables was a vector (the gradient), or a function from 
\begin_inset Formula $\mathbb{R}^{n}$
\end_inset

 into 
\begin_inset Formula $\mathbb{R}^{n}$
\end_inset

.
 Taking the derivatives with respect to all parameters again, yields a matrix.
 This matrix is called the 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
Hessian matrix
\end_layout

\end_inset

Hessian matrix
\emph default
 or simply the 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
Hessian
\end_layout

\end_inset

Hessian
\emph default
.
\end_layout

\begin_layout Subsubsection*
Definition (Hessian):
\end_layout

\begin_layout Standard
The 
\begin_inset Formula $\mathbb{R}^{n\times n}$
\end_inset

 matrix 
\begin_inset Formula $H$
\end_inset

 containing all the second derivatives of a function 
\begin_inset Formula $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$
\end_inset

 at a position 
\begin_inset Formula $\mathbf{z}$
\end_inset

 
\begin_inset Formula 
\begin{eqnarray*}
H(\mathbf{z}) & = & \left(\begin{array}{ccccc}
\frac{\partial²f}{\partial x_{1}²}(\mathbf{z}) & \frac{\partial²f}{\partial x_{1}\partial x_{2}}(\mathbf{z})\\
\frac{\partial²f}{\partial x_{1}\partial x_{2}}(\mathbf{z})\\
\\
 &  &  &  & \frac{\partial²f}{\partial x_{n-1}\partial x_{n}}(\mathbf{z})\\
 &  &  & \frac{\partial²f}{\partial x_{n-1}\partial x_{n}}(\mathbf{z}) & \frac{\partial²f}{\partial x_{n}²}(\mathbf{z})
\end{array}\right)
\end{eqnarray*}

\end_inset

is called 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
Hessian
\end_layout

\end_inset

Hessian.
\end_layout

\begin_layout Standard
\align right
\begin_inset Formula $\Diamond$
\end_inset


\end_layout

\begin_layout Standard
The sufficient condition for a function 
\begin_inset Formula $f$
\end_inset

 to have a minimum or maximum at 
\begin_inset Formula $\mathbf{x}$
\end_inset

 is that 
\begin_inset Formula $H(\mathbf{x})$
\end_inset

 is 
\emph on
positive definite
\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
positive definite
\end_layout

\end_inset


\emph default
 or 
\emph on

\begin_inset Index idx
status collapsed

\begin_layout Plain Layout
negative definite
\end_layout

\end_inset

negative definite
\emph default
, respectively.
 This means that 
\begin_inset Formula $H(\mathbf{x})$
\end_inset

 has only 
\emph on
positive
\emph default
 or 
\emph on
negative
\emph default
 eigenvalues, respectively.
 You will see what that means in the following part about linear algebra.
 
\end_layout

\end_body
\end_document