index.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Cross-validation: What does it estimate and how well does it do it?</title>
    <meta charset="utf-8" />
    <meta name="author" content="Gül İnan  " />
    <script src="libs/header-attrs-2.11/header-attrs.js"></script>
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <script src="libs/fabric-4.3.1/fabric.min.js"></script>
    <link href="libs/xaringanExtra-scribble-0.0.1/scribble.css" rel="stylesheet" />
    <script src="libs/xaringanExtra-scribble-0.0.1/scribble.js"></script>
    <script>document.addEventListener('DOMContentLoaded', function() { window.xeScribble = new Scribble({"pen_color":["#1A292C"],"pen_size":3,"eraser_size":30,"palette":[]}) })</script>
    <link href="libs/panelset-0.2.6/panelset.css" rel="stylesheet" />
    <script src="libs/panelset-0.2.6/panelset.js"></script>
    <link rel="stylesheet" href="css/metropolis.css" type="text/css" />
    <link rel="stylesheet" href="css/cal.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# <b>Cross-validation: What does it estimate and how well does it do it?</b>
## by Stephen Bates, Trevor Hastie, and Robert Tibshirani <br>
### Gül İnan <br>
### <br>Istanbul Technical University
### <br><i>Machine Learning in Montpellier, Theory &amp; Practice, <br> January 6th, 2022 </i>

---


# Motivation
&lt;br&gt;
- In machine learning applications, when deploying a &lt;b&gt;&lt;span style="color:#067373;"&gt;predictive model&lt;/span&gt;&lt;/b&gt;, the main interest is on understanding &lt;b&gt;&lt;span style="color:#067373;"&gt;prediction accuracy&lt;/span&gt;&lt;/b&gt; of the model on future test points. 
- The standard measure of accuracy for a predictive model is the &lt;b&gt;&lt;span style="color:#067373;"&gt;prediction error&lt;/span&gt;&lt;/b&gt;, i.e., the &lt;b&gt;&lt;span style="color:#067373;"&gt;expected loss on future test points&lt;/span&gt;&lt;/b&gt;.
- For inference, both &lt;b&gt;&lt;span style="color:#067373;"&gt;good point estimates&lt;/span&gt;&lt;/b&gt; and &lt;b&gt;&lt;span style="color:#067373;"&gt;accurate confidence intervals&lt;/span&gt;&lt;/b&gt; are required for &lt;b&gt;&lt;span style="color:#067373;"&gt;prediction error&lt;/span&gt;&lt;/b&gt;.

---
# Motivation continu'ed
&lt;br&gt;
- From a practical point of view, &lt;b&gt;&lt;span style="color:#067373;"&gt;cross-validation (CV)&lt;/span&gt;&lt;/b&gt; is one of the widely-used resampling-based approaches to estimate a &lt;b&gt;&lt;span style="color:#067373;"&gt;point estimate&lt;/span&gt;&lt;/b&gt; and to build a &lt;b&gt;&lt;span style="color:#067373;"&gt;confidence interval&lt;/span&gt;&lt;/b&gt; for prediction error.
- However, &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; discuss that (statistical) properties of CV estimator of prediction error are **not well-understood** despite CV has a very simple functionality.

---
# Aim
&lt;br&gt;
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; firstly show that the &lt;b&gt;&lt;span style="color:#067373;"&gt;CV estimator of prediction error&lt;/span&gt;&lt;/b&gt;:
  - tracks **the accuracy of the model fit weakly** and, instead
  - estimates **the average prediction error of models fit across many (hypothetical) data sets from the same population**.  

---
# Aim continu'ed
&lt;br&gt; 
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; secondly show that &lt;/span&gt;&lt;/b&gt; the &lt;b&gt;&lt;span style="color:#067373;"&gt;naive confidence intervals&lt;/span&gt;&lt;/b&gt; based on CV estimate of prediction error give **poor coverage** since the variance of error estimates used to compute the width of the interval does not account for the **correlation between error estimates** in different folds, which arises from the fact that each data point is used both in training and testing.  
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; propose the &lt;b&gt;&lt;span style="color:#067373;"&gt;nested cross-validation (NCV)&lt;/span&gt;&lt;/b&gt; approach which provides confidence intervals with a **coverage close to the nominal level**.  
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; validate their work through deep theory and extensive numerical experiments (both simulation studies and real data examples).
 
---
# Setting and notation
&lt;br&gt;
- Consider a &lt;b&gt;&lt;span style="color:#067373;"&gt;supervised learning &lt;/span&gt;&lt;/b&gt; setting.  
- We have a data `\((X, Y)\)`, where `\(X=(X_1, \ldots, X_n)  \in \mathcal{X}^{n \times p }\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;feature matrix&lt;/span&gt;&lt;/b&gt; 
and `\(Y=(Y_1, \ldots, Y_n)  \in \mathcal{Y}^{n}\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;response vector&lt;/span&gt;&lt;/b&gt;.
- We assume that each &lt;b&gt;&lt;span style="color:#067373;"&gt;data point &lt;/span&gt;&lt;/b&gt; `\((X_i,Y_i)\)`, `\(i=1,\ldots,n\)`,
is i.i.d. from a &lt;b&gt;&lt;span style="color:#067373;"&gt; distribution&lt;/span&gt;&lt;/b&gt; `\(P\)`. 


---
# Setting and notation continu'ed
&lt;br&gt;
- Consider a &lt;b&gt;&lt;span style="color:#067373;"&gt;class of models &lt;/span&gt;&lt;/b&gt; parameterized by vector `\(\theta\)`.
- We assume that `\(\widehat f(x, \theta)\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;function that predicts&lt;/span&gt;&lt;/b&gt; `\(y\)` from `\(x \in \mathbb{R}^{p}\)` using the model with parameter `\(\theta\)`, where `\(\theta\)` takes values in some space `\(\Theta\)`.
- We let `\(\mathcal{A}\)` be a &lt;b&gt;&lt;span style="color:#067373;"&gt;model-fitting algorithm&lt;/span&gt;&lt;/b&gt; that returns the &lt;b&gt;&lt;span style="color:#067373;"&gt;fitted value&lt;/span&gt;&lt;/b&gt; of the parameter vector, `\(\widehat \theta =\mathcal{A}(X,Y) \in \Theta\)` based on the observed data `\((x, y)\)`. 
---
# Prediction error 
&lt;br&gt;
- In &lt;b&gt;&lt;span style="color:#067373;"&gt;measuring the accuracy of a model&lt;/span&gt;&lt;/b&gt;, we are interested in &lt;b&gt;&lt;span style="color:#067373;"&gt; prediction error (out-of-sample error)&lt;/span&gt;&lt;/b&gt; which is defined as the &lt;b&gt;&lt;span style="color:#067373;"&gt;expected loss on future data points&lt;/span&gt;&lt;/b&gt; `\((X_{n+1},Y_{n+1})\)`:

`\begin{equation}
Err_{XY}: = \mathbb{E}\bigg[	\ell \big(\hat{f}(X_{n+1},\hat \theta ),Y_{n+1}\big) \vert (X,Y) \bigg] , \nonumber
\end{equation}`

- where `\((X_{n+1},Y_{n+1})\)` is an &lt;b&gt;&lt;span style="color:#067373;"&gt;independent test point&lt;/span&gt;&lt;/b&gt; from the same distribution `\(P\)`. 
- The expression `\(\hat{f}(X_{n+1},\hat \theta)\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;predicted value&lt;/span&gt;&lt;/b&gt; of `\(Y_{n+1}\)` at the future point `\(X_{n+1}\)` and 
- `\(\hat \theta\)` is the fitted value of the parameter estimated through algorithm `\(\mathcal{A}\)` based on the training data `\((X,Y)\)`.
- The expression `\(\ell \big(\hat{f}(X_{n+1},\hat \theta ),Y_{n+1}\big)\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;loss&lt;/span&gt;&lt;/b&gt; between predicted value of `\(Y_{n+1}\)` and `\(Y_{n+1}\)` itself.
- Here, the &lt;b&gt;&lt;span style="color:#067373;"&gt;loss function&lt;/span&gt;&lt;/b&gt; `\(\ell(.)\)` could be squared error loss, classification error, or deviance (cross-entropy). 
- Furthermore `\(Err_{XY}\)` can be considered as a &lt;b&gt;&lt;span style="color:#067373;"&gt;random quantity&lt;/span&gt;&lt;/b&gt;  depending on the training data `\((X,Y)\)`.

---
# Expected prediction error
&lt;br&gt;
- On the other hand, we may also be interested in learning algorithms' &lt;b&gt;&lt;span style="color:#067373;"&gt;average performance on predicting future test points&lt;/span&gt;&lt;/b&gt;  when designing and comparing &lt;b&gt;&lt;span style="color:#067373;"&gt;algorithms&lt;/span&gt;&lt;/b&gt; with each other.
- This quantity of interest can be formally defined as
the &lt;b&gt;&lt;span style="color:#067373;"&gt;expected value of prediction error&lt;/span&gt;&lt;/b&gt; across possible training data sets of size n drawn from the same data distribution `\(P\)`:

`\begin{equation}
Err :=\mathbb{E}\big[Err_{XY}\big].\nonumber
\end{equation}`

- Shortly: it is the expectation of prediction error across possible training sets (from the same distribution).

---

- Note that estimates of the quantities `\(Err_{XY}\)` and `\(Err\)` **cannot be computed** when the **data distribution** `\(P\)` is **unknown**.
- Then, &lt;b&gt;&lt;span style="color:#067373;"&gt;resampling based methods&lt;/span&gt;&lt;/b&gt; such as cross-validation, bootstrap, and jacknife or &lt;b&gt;&lt;span style="color:#067373;"&gt;analytical methods&lt;/span&gt;&lt;/b&gt; such as AIC, BIC, Mallow's `\(C_p\)`, and covariance penalties can be used to **estimate the quantities** `\(Err_{XY}\)` and `\(Err\)`.


---
# K-fold cross-validation
&lt;br&gt;
- In &lt;b&gt;&lt;span style="color:#067373;"&gt;K-fold cross-validation&lt;/span&gt;&lt;/b&gt;, we randomly partition the data `\((X,Y) = \mathcal{I}\)` into `\(K\)` equally sized &lt;b&gt;&lt;span style="color:#067373;"&gt;disjoint
folds (subsets)&lt;/span&gt;&lt;/b&gt; `\(\mathcal{I}_k\)` `\((k=1,\ldots,K)\)`.
- Here the fold size is `\(m=n/K\)` and the whole data is `\(\mathcal{I} = \cup_{k=1}^{K} \mathcal{I}_k\)`.
- When the data point `\((x_i,y_i) \in \mathcal{I}_k\)`, we will also write `\(i \in \mathcal{I}_k\)` `\((k=1,\ldots,K)\)`.

&lt;div class="figure" style="text-align: center"&gt;
&lt;img src="imgs/K_CV_single.png" alt="&amp;lt;i&amp;gt;Partition of the data into K-folds.&amp;lt;/i&amp;gt;" width="13%" /&gt;
&lt;p class="caption"&gt;&lt;i&gt;Partition of the data into K-folds.&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;div style="text-align:right;"&gt;&lt;font size="4px"&gt;&lt;a href="https://www.researchgate.net/profile/Mingchao-Li/publication/331209203/figure/fig2/AS:728070977748994@1550597056956/K-fold-cross-validation-method.png"&gt;Image Source&lt;/a&gt;&lt;/font&gt;&lt;/div&gt;

---

- Consider the first fold `\(\mathcal{I}_{1}\)` and hold it out as future data set (or &lt;b&gt;&lt;span style="color:#067373;"&gt;test set&lt;/span&gt;&lt;/b&gt;).
- The **remaining data points** `\((x_i,y_i) \in \mathcal{I} \backslash \mathcal{I}_{1}\)`, which are not in the first fold, are called as &lt;b&gt;&lt;span style="color:#067373;"&gt;training set&lt;/span&gt;&lt;/b&gt;.
- Let `\(\hat \theta^{(-1)} = \mathcal{A}\big((X_i,Y_i)_{i \in \mathcal{I} \backslash \mathcal{I}_{1}}\big)\)`  be the **parameter estimate based on training data set**, then
we can calculate the **prediction error** for future data set `\(\mathcal{I}_{1}\)`, of size  `\(m\)`, as follows:

`\begin{equation}
 \frac{1}{m} \sum_{{i} \in \mathcal{I}_1}\ell \big(\hat{f}(x_{i},\hat \theta^{(-1) } ),y_i\big).
\end{equation}`
---
- In &lt;b&gt;&lt;span style="color:#067373;"&gt;K-fold cross-validation&lt;/span&gt;&lt;/b&gt;, we **iteratively repeat** this process for each fold `\((k=1,\ldots,K)\)`.


&lt;div class="figure" style="text-align: center"&gt;
&lt;img src="imgs/K_CV.png" alt="&amp;lt;i&amp;gt;K-fold cross-validation.&amp;lt;/i&amp;gt;" width="60%" /&gt;
&lt;p class="caption"&gt;&lt;i&gt;K-fold cross-validation.&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;div style="text-align:right;"&gt;&lt;font size="4px"&gt;&lt;a href="https://www.researchgate.net/profile/Mingchao-Li/publication/331209203/figure/fig2/AS:728070977748994@1550597056956/K-fold-cross-validation-method.png"&gt;Image Source&lt;/a&gt;&lt;/font&gt;&lt;/div&gt;

---
# CV estimate of prediction error
&lt;br&gt;
- The **average of prediction errors over K folds** is given as follows:

`\begin{equation}
	\widehat {Err} ^{(CV)} := \frac{1} {K} \sum_{k=1}^{K} \frac{1} {m}\sum_{{i} \in \mathcal{I}_k} \ell \big(\hat{f}(x_{i},\hat \theta^{(-1)}),y_i\big).
\end{equation}`

- This is usually called as the &lt;b&gt;&lt;span style="color:#067373;"&gt;CV estimate of prediction error&lt;/span&gt;&lt;/b&gt;.
- **Relationship between `\(\widehat {Err} ^{(CV)}\)`, `\(Err_{XY}\)`, and `\(Err\)`**: Intuitively, the inner sum is an estimate for `\(Err_{XY}\)` for a fixed fold, and the double sum estimates `\(Err\)` with `\(\mathcal{I}_k\)`  `\(( k=1,\ldots,K)\)` being different samples from the same distribution &lt;b&gt;&lt;span style="color:#067373;"&gt;(De Benito Delgado, 2021)&lt;/span&gt;&lt;/b&gt;.

---
# What prediction error are we estimating?

&lt;br&gt;
- `\(Err_{XY}\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;prediction error of the model which is fit on the training data set&lt;/span&gt;&lt;/b&gt;.
- `\(Err\)` is the &lt;b&gt;&lt;span style="color:#067373;"&gt;average of the fitting algorithm runs on the same-sized data sets drawn
from the same distribution `\(P\)`.&lt;/span&gt;&lt;/b&gt;
- The **former quantity** is of the most interest to a practitioner **deploying a specific model**, whereas the **latter**  may be of interest to a researcher **comparing different fitting algorithms**.

---
- Some earlier studies such as Zhang (1995), Hastie et al. (2009), and Yousef (2020)
have observed that **cross-validation estimate provides little information** about `\(Err_{XY}\)`, which is also called as _weak correlation_ problem in the literature.
- For the special case of the linear model, &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; claim that CV estimate should be **considered as an estimate** of `\(Err\)`  rather than `\(Err_{XY}\)`.


---
# `\(Err_{X}\)`:  A different target of inference
&lt;br&gt;
- Assume a &lt;b&gt;&lt;span style="color:#067373;"&gt;homoskedastic Gaussian linear model&lt;/span&gt;&lt;/b&gt; as follows:

`\begin{equation}
y_i = x_{i}^{T} \theta + \epsilon_i \quad \text{where} \quad \epsilon_i \overset{i.i.d.}{\sim} N(0,\sigma^2)  \quad \text{i=1,...,n.}
\end{equation}`

- Define a &lt;b&gt;&lt;span style="color:#067373;"&gt;new (mid) key quantity&lt;/span&gt;&lt;/b&gt; as follows:

`\begin{equation}
Err_{X} := \mathbb{E}[Err_{XY} | X],
\end{equation}`

- which **falls between** `\(Err\)` and `\(Err_{XY}\)` as visualized below:

&lt;br&gt;
&lt;div class="figure" style="text-align: center"&gt;
&lt;img src="imgs/paper_fig2.png" alt="&amp;lt;i&amp;gt;Possible targets of inference for cross-validation.&amp;lt;/i&amp;gt;" width="60%" /&gt;
&lt;p class="caption"&gt;&lt;i&gt;Possible targets of inference for cross-validation.&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;

---
$$ \newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}$$ 

- **Lemma 1:** When ordinary least squares (OLS) is used as the fitting
algorithm along with a squared-error loss function, the CV estimate of prediction error, 
`\(\widehat {Err}^{(CV)}\)`, is linearly invariant.

--

- &lt;&lt; Under this setting, residuals turns out to be the same for both **original** `\((x_{1}, y_{1}),...,(x_{n}, y_{n})\)` and **shifted data** `\((x_{1}, y_{1}^{'}),...,(x_{n}, y_{n}^{'})\)`, where `\((y_{i}^{'} = y_{i} + x_{i}^{T} \kappa)\)`. Since the CV estimate of prediction error is the mean of the squared residuals, the CV estimate of prediction error also turns out to be **the same** for both the **original data** and the **shifted data**. &gt;&gt;

---
- **Theorem 1:** Assume homoskedastic Gaussian linear model holds and that we use
squared-error loss function. Let `\(\widehat {Err}\)` be a **linearly invariant estimate of prediction
error** (such as `\(\widehat {Err}^{(CV)}\)` using OLS as the fitting algorithm). Then,

`\begin{equation}
 Err_{XY} \perp \widehat {Err} \quad | \quad X.
\end{equation}`

--
- &lt;&lt; Recall from classical linear regression theory that when using OLS, the **estimated coefficient** vector `\(\widehat {\theta}\)` is independent of the **residuals** `\((Y-X\widehat {\theta})\)`:

`\begin{equation}
 \widehat {\theta} \perp  (Y-X\widehat {\theta}) \quad | \quad X.
\end{equation}`

- Since `\(Err_{XY}\)`  is a function of `\(\widehat {\theta}\)` only, which is the OLS estimate of `\(\theta\)`, and any linearly invariant estimate of prediction error `\(Err\)` is a function only of residuals, `\(Y-X\widehat {\theta}\)`, by the invariance property, then `\(Err_{XY} \perp \widehat{Err} \quad | \quad X\)`. &gt;&gt;

---
- **Corollary 1:**  Under the conditions of Theorem 1, we get the following decomposition:

`\begin{equation}
\mathbb{E}\Big[\big(\widehat Err-Err_{XY}\big)^2\Big] = \mathbb{E}\Big[\big(\widehat Err-Err_{X}\big)^2\Big] + \mathbb{E}\Big[Var\big(Err_{XY}|X\big)\Big].
\end{equation}`

--

- &lt;&lt; Any linearly invariant estimator (such as cross-validation) has **larger mean squared error** (MSE) as an estimate of `\(Err_{XY}\)` than as an estimate of `\(Err_{X}\)`. &gt;&gt;

- This implies that `\(\widehat Err^{CV}\)` is a better estimate of the intermediate quantity `\(Err_{X}\)` than of `\(Err_{XY}\)`.

&lt;br&gt;
&lt;div class="figure" style="text-align: center"&gt;
&lt;img src="imgs/paper_fig2.png" alt="&amp;lt;i&amp;gt;Possible targets of inference for cross-validation.&amp;lt;/i&amp;gt;" width="60%" /&gt;
&lt;p class="caption"&gt;&lt;i&gt;Possible targets of inference for cross-validation.&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;


---
# Example
&lt;br&gt;
- Consider an experiment in a simple linear model with `\(n=100\)` observations
and `\(p=20\)` features `\(\overset{i.i.d.}{\sim} N(0,1)\)`, which is replicated `\(2000\)` times. 
- **MSE** of `\(\widehat {Err} ^{(CV)}\)` relative to **three estimands**: `\(Err\)`, `\(Err_{X}\)`, and `\(Err_{XY}\)`.

&lt;img src="imgs/paper_fig3a.png" width="30%" style="display: block; margin: auto;" /&gt;

- **Side note:** Each pair of points connected by a line represents the 2000 replicates with the same feature matrix `\(X\)`.
---
- We see that `\(\widehat {Err}^{(CV)}\)` has **lower MSE** for `\(Err_{X}\)` than `\(Err_{XY}\)`.
- These results suggest that, `\(Err_{X}\)` is a more **natural target of inference** (estimand) rather than `\(Err_{XY}\)` for `\(\widehat {Err} ^{(CV)}\)`.

&lt;br&gt;
&lt;img src="imgs/paper_fig3a.png" width="30%" style="display: block; margin: auto;" /&gt;

---
# Relationship between `\(Err\)` and `\(Err_{X}\)`
&lt;br&gt;
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; further investigate the **relationship** between `\(Err\)` and `\(Err_{X}\)` through asymptotic analysis.

&lt;br&gt;
&lt;div class="figure" style="text-align: center"&gt;
&lt;img src="imgs/paper_fig2.png" alt="&amp;lt;i&amp;gt;Possible targets of inference for cross-validation.&amp;lt;/i&amp;gt;" width="60%" /&gt;
&lt;p class="caption"&gt;&lt;i&gt;Possible targets of inference for cross-validation.&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;

- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; show that the variance of `\(Err_{X}\)` (which also has mean `\(Err\)`) is **small** compared with the variance of `\(Err_{XY}\)` (which also has mean `\(Err\)`), showing that `\(Err_{X}\)` is **close** to `\(Err\)`.
---
- Note that in the example, MSE of `\(\widehat {Err} ^{(CV)}\)` is **similar** when estimating either `\(Err\)` or `\(Err_{X}\)`, but **significantly different** when estimating `\(Err_{XY}\)`.

&lt;img src="imgs/paper_fig3a.png" width="30%" style="display: block; margin: auto;" /&gt;

---
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; show that `\(	\widehat {Err} ^{(CV)}\)`
is **closer** to `\(Err\)` and `\(Err_{X}\)` than to `\(Err_{XY}\)` in the proportional asymptotic limit (for `\(n &gt; p\)`, as `\(n, p  \rightarrow \infty\)` with `\(n/p \rightarrow \lambda &gt;1\)`).
- Combined with the earlier results, this implies that `\(	\widehat {Err} ^{(CV)}\)` is a **better estimator** for `\(Err\)` than for `\(Err_{XY}\)`.
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; also show that `\(	\widehat{Err}^{(CV)}\)` is **asymptotically uncorrelated** with `\(Err_{XY}\)`.
---
# Dependence structure of CV errors
&lt;br&gt;
- Let 
`\(e_i =\ell \big(\hat{f}(x_{i},\hat \theta^{(-1) } ),y_i\big)\)` be the **error** for each `\(i \in \mathcal{I}_k\)` ( `\(k=1,\ldots,K\)`), resulting _m_ different `\(e_i\)`'s for each `\(\mathcal{I}_k\)`.
- Then, we can **re-define** &lt;b&gt;&lt;span style="color:#067373;"&gt;CV point estimate of prediction error&lt;/span&gt;&lt;/b&gt; as the **average of errors** as follows:

`\begin{equation}
	\widehat {Err } ^{(CV)}:= \frac{1} {K} \sum_{k=1}^{K}    \frac{1} {m}\sum_{{i} \in \mathcal{I}_k} \ell \big(\hat{f}(x_{i},\hat \theta^{(-1)}),y_i\big)
	= \frac{1} {n}  \sum_{i=1}^{n}e_i 
	= \bar e, 
\end{equation}`

- where `\(n =  K \times m\)`.
---
- Assuming that `\(e_i\)`'s are i.i.d., then &lt;b&gt;&lt;span style="color:#067373;"&gt;estimate of the standard error of CV point estimate of prediction error&lt;/span&gt;&lt;/b&gt; would be:

`\begin{equation}
	\widehat{se}^{(CV)}:=\frac{1} {\sqrt n} \times \sqrt {\frac{1} {n-1} \sum_{i=1}^{n}(e_i-\bar e)^2},  \nonumber
\end{equation}`
- where the second term in the multiplication refers to the  &lt;b&gt;&lt;span style="color:#067373;"&gt;empirical standard deviation&lt;/span&gt;&lt;/b&gt; of the `\(e_i\)`. 

---
# A `\(100(1-\alpha)\%\)` confidence interval for prediction error
&lt;br&gt;
- A `\(100(1-\alpha)\%\)` &lt;b&gt;&lt;span style="color:#067373;"&gt;confidence interval for prediction error&lt;/span&gt;&lt;/b&gt; can be constructed as follows:

`\begin{equation}
\Big(\widehat {Err}^{(CV)} - z_{1-(\frac{\alpha}{2})} \times \widehat{se}^{(CV)} \quad, \quad \widehat {Err}^{(CV)} + z_{1-(\frac{\alpha}{2})} \times \widehat{se}^{(CV)} \Big),
\end{equation}`

- where `\(0 &lt; \alpha &lt; 1\)`, `\(z_{1- (\frac{\alpha}{2}) }\)` is the `\(1- (\frac{\alpha}{2})\)` quantile of the standard normal distribution. 
- The intervals are called as &lt;b&gt;&lt;span style="color:#067373;"&gt;naive cross-validation intervals&lt;/span&gt;&lt;/b&gt;.

- However, since every data point is used in both in training and testing, we **cannot accept** that are `\(e_i\)`'s are **independent** of each other. 

- Any **confidence interval** built on this assumption would have **poor coverage**.

---
# Example re-visited
&lt;br&gt;
- The &lt;b&gt;&lt;span style="color:#067373;"&gt;naive cross-validation intervals&lt;/span&gt;&lt;/b&gt; for **three estimands**: `\(Err\)`, `\(Err_{X}\)`, and `\(Err_{XY}\)` are built and miscoverage rates are estimated.

&lt;img src="imgs/paper_fig3b.png" width="30%" style="display: block; margin: auto;" /&gt;

- The naive CV intervals **fail** due to large miscoverage rates. 
- **Side note:** The nominal miscoverage rate is `\(10\%\)`.
---
&lt;style type="text/css"&gt;
.pull-left {
  float: left;
  width: 50%;
}
.pull-right {
  float: right;
  width: 50%;
}
&lt;/style&gt;

.pull-left[
- The fundamental paper of &lt;b&gt;&lt;span style="color:#067373;"&gt;Bengio and Grandvalet (2004)&lt;/span&gt;&lt;/b&gt; gives the key structure of the covariance matrix of `\(e_i\)`'s such that:

`\begin{equation}
Var(\widehat {Err}^{(CV)}) = \frac{1}{n}a_1 + \frac{n/K-1}{n}a_2 + \frac{n-n/K}{n}a_3,
\end{equation}`

- where `\(a_1=Var(e_i)\)` is the **variance of the diagonal elements**,
- `\(a_2 = Cov(e_i, e_j)\)` is the **covariance of the off-diagonal elements within the same fold** (in-block covariance of errors due to a common training set), and
- `\(a_3 = Cov(e_i, e_j)\)`  is the **covariance between-blocks**, covariance due the dependence between training sets `\(\mathcal{I}_k\)`  `\((k=1,\ldots,K)\)`.
]


.pull-right[
&lt;div class="figure" style="text-align: center"&gt;
&lt;img src="imgs/covariance_structure.png" alt="&amp;lt;i&amp;gt;Structure of the covariance matrix of errrors.&amp;lt;/i&amp;gt;" width="60%" /&gt;
&lt;p class="caption"&gt;&lt;i&gt;Structure of the covariance matrix of errrors.&lt;/i&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;div style="text-align:right;"&gt;&lt;font size="4px"&gt;&lt;a href="https://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf"&gt;Image Source&lt;/a&gt;&lt;/font&gt;&lt;/div&gt;
]

---
- The constants `\(a_2\)` and `\(a_3\)` will typically be positive, in which case: 

`\begin{equation}
Var(\widehat {Err}^{(CV)}) &gt; \frac{1}{n}a_1,
\end{equation}`

- The naive cross-validation intervals **implicitly assume** `\(a_2=0\)` and `\(a_3=0\)`. 
- Hence estimating the variance of `\(\widehat{Err}^{(CV)}\)` as `\(\widehat{se}^2\)` results in an estimate that it is too **small**, and, in turn, **poor coverage**.


---
# Target of inference: Confidence intervals 
&lt;br&gt;
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt;
develops an **estimator** that **empirically estimate the variance** of `\(\widehat{Err}^{(CV)}\)` across many subsamples.

- **Definition 2:** For a sample of size n split into K folds, the cross-validation MSE is:

`\begin{equation}
MSE_{K,n} := \mathbb{E}\big[ (\widehat {Err}^{(CV)} - Err_{XY})^2 \big].
\end{equation}`

- MSE contains both a **bias term** and **variance term**, but the **bias** typically
**small** for cross-validation (Efron, 1983; Efron and Gong, 1983; Efron and Tibshirani, 1997).
- **MSE** can be viewed as a slightly conservative version of the **variance** of the cross-validation estimator.
- The estimate of the MSE can be used to construct confidence intervals for `\(Err_{XY}\)` since it is what
typically matters for practical applications.


---
- **Lemma 2:** For a single split, randomly partition the data into a training set with `\(K-1\)` folds and denote it by 
`\(\mathcal{I}_{(train)} = \cup_{k=1}^{K-1} \mathcal{I}_k=(\tilde X, \tilde Y)\)`, and call the remaining fold
as `\(\mathcal{I}_{(out)}\)`. Using only `\((\tilde X, \tilde Y)\)`, define the prediction error `\(Err_{\tilde X, \tilde Y}\)`
and an estimator `\(\widehat Err_{\tilde X, \tilde Y}\)` such as cross-validation, as usual. For the hold-out data set, calculate errors `\(\{e^{(out)}\}_{i \in {I}_{(out)}}\)` and their average
`\(\bar e^{(out)}\)`. 

- Then, estimate the MSE from the data as follows:

`\begin{equation}
\mathbb{E}\big[ (\widehat {Err}_{\tilde X, \tilde Y} - Err_{\tilde X, \tilde Y})^2 \big] = \mathbb{E}\big[ (\widehat {Err}_{\tilde X, \tilde Y} -\bar e^{(out)})^2 \big] - \mathbb{E}\big[ (\bar e^{(out)} - Err_{\tilde X, \tilde Y})^2 \big].
\end{equation}`

&lt;img src="imgs/paper_fig8.png" width="60%" style="display: block; margin: auto;" /&gt;


---
# Nested cross-validation (CV) estimate of MSE
#### &lt;b&gt;&lt;span style="color:#067373;"&gt;Nested CV algorithm&lt;/span&gt;&lt;/b&gt;
- **Repeatedly** split data into `\(K\)` folds with `\(K-1\)` building `\(\mathcal{I}_{(train)}\)` and the remaining
one being `\(\mathcal{I}_{(out)}\)`.
- For each split j:
  - Compute `\(\epsilon_j := \widehat {Err}_{\tilde X \tilde Y}\)` with (K-2)-fold
 cross-validation over K-1 folds in `\(\mathcal{I}_{(train)}\)`.
  - Train model on `\(\mathcal{I}_{(train)}\)` and compute errors `\(e_{i}\)` for all data points `\((x_i,y_i) \in \mathcal{I}_{(out)}\)`.
  - Compute `\(\bar e_{out} := \text{mean of} \quad \{i\}_{e \in I_{out}}\)`.
  - Set `\(a_j := (\widehat {Err}_{\tilde X \tilde Y} - \bar e_{out})^2\)` (estimate of (the first term at RHS)).
  - Set `\(b_j := \text{empirical variance of} \quad \{e_i\}_{i \in I_{out}}\)` (estimate of (the second term at RHS)).
- Output  `\(\widehat{MSE} = mean(a_j)-mean(b_j)\)`. 
- Output `\(\widehat {Err}^{(NCV)} := mean(\epsilon_j)\)`.

&lt;img src="imgs/paper_fig8.png" width="40%" style="display: block; margin: auto;" /&gt;

---
- Nested CV algorithm provides us a **point estimate** for prediction error, denoted by `\(\widehat {Err}^{(NCV)}\)`, and **estimate for MSE**, denoted by `\(\widehat{MSE}\)`. 

- **Theorem 2:** For a nested CV with a sample of size `\(n\)`:

`\begin{equation}
\mathbb{E}\big[ \widehat{MSE} \big] :=  MSE_{K-1,((K-1)n/K)}, 
\end{equation}`

- where `\(n/K\)` is the fold size.
- Since the estimation is done over `\(K-1\)` folds, `\(\widehat{MSE}\)` estimates the actual quantity of interest `\(MSE_{K,n}\)` with some **bias**.
- `\(\widehat{MSE}\)` is **rescaled** to obtained unbiased estimate for `\(MSE_{K,n}\)`.
- Similarly, `\(\widehat {Err}^{(NCV)}\)` is also adjusted (**de-biased**).

---
# A `\(100(1-\alpha)\%\)` confidence interval
&lt;br&gt;

- Finally, a `\(100(1-\alpha)\%\)` confidence interval is obtained as follows:

`\begin{equation}
\Big(\widehat {Err}^{(NCV)} - \widehat {bias}- z_{1-(\frac{\alpha}{2})} \widehat{se}^{(NCV)}, \quad \widehat {Err}^{(NCV)} + \widehat {bias}- z_{1-(\frac{\alpha}{2})} \widehat{se}^{(NCV)}\Big), 
\end{equation}`

- where

- \begin{equation}
 \widehat {bias} := \Big(1 + \big(\frac{K-2}{K}\big)\Big) (\widehat {Err}^{(NCV)} - \widehat {Err}^{(CV)}) \quad \text{and}  \quad
 \widehat{se}^{(NCV)} :=\sqrt{\frac{K-1}{K}}\sqrt{\widehat {MSE}}.
 \end{equation}
---
# Simulation experiments: Data generation scenario
&lt;br&gt;
- The **coverage** of nested CV intervals approach is investigated for **classification** and **regression** problems over synthetic data sets (and real data sets).
- Consider a **sparse logistic data generating model**:

`\begin{align}
Pr(Y_i = 1 | X_i = x_i) = \frac{1}{1 + \exp(-x_{i}^{T}\theta)}, \quad \text{i=1,...,n},
\end{align}`

- where `\(n\)` is the number of observations, `\(p\)` is the number of features, 
- `\(X_i\)` is the feature matrix consisting of i.i.d standard Gaussian variables,  
- the coefficient 
`\(\theta = c \times (1,1,1,1,0,0,...)^{T} \in \mathcal{R}^{p}\)` and 
- `\(c\)` is chosen such that &lt;b&gt;&lt;span style="color:#067373;"&gt;Bayes error&lt;/span&gt;&lt;/b&gt; is either `\(33.2\%\)` or `\(22.5\%\)` which is the **optimal lower bound** for `\(Err\)`.
---
# Simulation experiments: Performance metrics
&lt;br&gt;
- In each case:
   - The **miscoverage** of naive CV (CV) and nested CV (NCV) intervals are reported where
the **nominal miscoverage** rate is `\(10\%\)`.
   - The **width** of NCV intervals are expressed relative to the width of CV intervals.
   - 10-fold CV and 10-fold-NCV with `\(200\)` splits are used.
- `\(R\)` scripts of reproducing experiments are available at: https://github.com/stephenbates19/nestedcv_experiments.

---
# Simulation results: Low-dimensional setting results
&lt;br&gt;
- `\(n=100\)`, `\(p=20\)`, and (un-regularized) logistic regression is used as fitting algorithm.
- Nested CV gives **coverage closer** to the nominal level.

&lt;img src="imgs/result1.png" width="100%" style="display: block; margin: auto;" /&gt;

- A "Hi" miscoverage is one where the **confidence interval is too large** and the point estimate falls below the interval; conversely for a "Lo" miscoverage.

---
# Simulation results: High-dimensional setting results
&lt;br&gt;
- `\(n={90,200}\)`, `\(p=100\)`, and `\(\ell_1\)` penalized logistic regression with a fixed
penalty level is used as fitting algorithm.
- Nested CV gives **coverage more closer** to the nominal level.

&lt;img src="imgs/result2.png" width="100%" style="display: block; margin: auto;" /&gt;

---
# Discussion &amp; Future work
&lt;br&gt;
- Nested CV is more **computationally intensive** than standard CV, but, parallel programming can be used for speeding up the algorithm.
- It is well-known that CV is also commonly used for selecting a good value of a learning algorithm's hyperparameters (fine-tuning).
- &lt;b&gt;&lt;span style="color:#067373;"&gt;Bates, Hastie, and Tibshirani (2021)&lt;/span&gt;&lt;/b&gt; expect that NCV would be of use for hyperparameter selection since it yields more accurate confidence intervals for prediction error.

---

class: center, middle


&lt;br&gt; Merci!.. &lt;/br&gt;

    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "atelier-forest-light",
"highlightLines": true,
"highlightSpans": true,
"countIncrementalSlides": true,
"ratio": "16:9"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>