02-gaussian-copula.Rmd_

# Gaussian copula {#gaussian-copula}

```{r}
library(extraDistr)
library(ggplot2)
library(ggtext)
library(ggrepel)
library(gt)
library(dplyr)
library(tidyr)
library(rlang)
library(purrr)
library(quantreg)
library(lpSolve)
```

여러 개의 연속형 변수로 이루어진 데이터로부터 다변량 분포를 추정한 뒤, 해당 분포로부터 랜덤 샘플을 생성하고자 할 때가 있다. 이는 해당 변수가 영향을 미치는 결과에 대한 불확실성(uncertainty) 추정 등에 유용하다. 예를 들어, 날씨와 재생 에너지 생산량간의 관계를 알고 있다면, 내일 날씨의 분포를 추정하고 그로부터 여러 개의 날씨 시나리오를 생성하여, 내일 재생 에너지 생산량에 대한 분포를 추정할 수 있을 것이다.  다변량 분포의 함수 형태를 알고 있다면(예를 들어 다변량 정규분포 등) maximum likelihood estimation 등의 방법으로 분포를 추정할 수 있겠으나, 분포의 형태를 모른다고 가정할 경우에는 다른 접근 방법이 필요하다. 그 중 하나의 방법으로 고려할 수 있는 것이 Gaussian copula이다.

우선, 단일변수(univariate)에 대해 고려해보고, 이후 다변량 분포에 대해 확장해보자.

## Univariate

연속형 변수 $Y$에 대한 분포 $F_Y(y)$로부터 얻어진 $N$개의 관측치 $y_{(1)}, y_{(2)}, \ldots, y_{(N)}$가 존재한다고 하자 ($y_{(1)} \leq y_{(2)} \leq \cdots \leq y_{(N)}$). 이 때, 함수 $F_Y$의 형태는 모른다고 가정하자.

### Empirical distribution

함수의 형태를 모를 때, 관측된 표본만을 이용하여 함수를 아래와 같이 step function으로 추정할 수 있다.

\[
\hat{F}_Y(y) = \frac{1}{N} \sum_{i = 1}^{N} \mathbb{I}(y_{(i)} \leq y)
\]

위 추정방법을 이용하여 아래 두 가지 분포를 추정해보자.

\begin{eqnarray*}
F_1(y) &=& N(0, 1^2)\\
F_2(y) &=& Gumbel(-\gamma, 1)
\end{eqnarray*}

위 각 분포에 대해 각각 관측치 $N = 10$인 경우와 $N = 100$인 경우의 추정결과를 통해, 관측치가 증가할수록 원 분포에 가까운 추정결과를 얻을 수 있음을 확인해보자.

```{r}
set.seed(2)
euler <- - digamma(1)

empirical_df <- tibble(
  N = c(10L, 100L)
) %>% mutate(
    F1 = map(N, ~ rnorm(.x, 0, 1)),
    F2 = map(N, ~ rgumbel(.x, - euler, 1))
  ) %>%
  pivot_longer(starts_with("F"), names_to = "distribution", values_to = "y") %>%
  mutate(empirical = map(y, ecdf))

Fn_list <- list(
  F1 = function(q) pnorm(q, 0, 1),
  F2 = function(q) pgumbel(q, - euler, 1)
)
  
empirical_estimate_df <- empirical_df %>%
  select(N, distribution, empirical) %>%
  mutate(
    y = list(seq(-3, 3, by = 0.01)),
    Fy = map2(distribution, y, ~ exec(Fn_list[[.x]], .y)),
    Fy_hat = map2(empirical, y, ~ exec(.x, .y))
  ) %>%
  select(-empirical) %>%
  unnest(c(y, Fy, Fy_hat))
```


```{r, echo=FALSE}
empirical_estimate_df %>%
  ggplot(aes(x = y, y = Fy_hat)) +
  geom_line(aes(y = Fy), data = . %>% filter(N == 10), color = "black") +
  geom_line(aes(color = factor(N), group = N)) +
  scale_color_manual(values = c(`10` = "steelblue", `100` = "firebrick")) +
  labs(x = NULL, y = NULL,
       title = "Estimated empirical distribution: True vs. <span style = 'color:steelblue'>N = 10</span> vs. <span style = 'color:firebrick'>N = 100</span>") +
  theme(
    plot.title = element_markdown(),
    plot.title.position = "plot",
    legend.position = "none"
  ) +
  facet_wrap(~ distribution)
```


### Interpolation and extrapolation

위 step function $\hat{F}_Y(y)$들은 $Y$값이 $N$개의 관측치와 일치할 때만 0보다 큰 값을 나타내는 discrete probability distribution을 나타낸다. 원 확률분포 $F(y)$를 연속형 변수 $Y$에 대한 분포라고 가정하였므로, 연속형 변수 $Y$에 대한 랜덤 샘플을 생성하기 위해서는 위 step function을 continuous function으로 적절히 변환해주어야 한다.

이러한 변환 방법에는 여러가지 방법이 있겠으나, 본 장에서는 $Y \in (-\infty, \infty)$라 가정하고 간단하게 아래의 두 가지 방법을 섞어서 사용해보기로 하자.

- $\hat{F}_Y(y) \in [0.1, 0.9]$인 경우, 즉, $y$값이 0.1-quantile과 $0.9$-quantile 사이로 추정될 경우, linear interpolation으로 확률분포함수를 추정해보다.
- $\hat{F}_Y(y) \notin [0.1, 0.9]$인 경우, 즉, $y$값이 매우 작거나 매우 큰 경우, exponential extrapolation으로 확률분포함수를 추정해보자.

```{block2, type = 'note'}
**NOTE**: 위 방법은 간단한 예를 보이기 위함이며, 알고 있는 변수의 특성이나 주어진 데이터에 따라 적절한 방법을 적용할 필요가 있다.
```

#### Linear interpolation

\[
\tilde{F}_Y(y) = \frac{(y - y_{(i)})\hat{F}_Y(y_{(i + 1)}) + (y_{(i + 1)} - y)\hat{F}_Y(y_{(i)})}{y_{(i + 1)} - y_{(i)}}, \; i \in [1, N - 1], y \in [y_{(i)}, y_{(i + 1)}) 
\]

위 함수 $\tilde{F}_Y(y)$는 step function $\hat{F}_Y(y)$의 knot $y_{(1)}, \ldots, y_{(N)}$의 각 구간에서 linear function으로 정의되는 piecewise linear 함수이다.


```{r}
interpolation_df <- empirical_df %>%
  mutate(interpolation = map2(empirical, y, ~ approxfun(x = .y, y = exec(.x, .y))))
```


```{r}
interpolation_estimate_df <- interpolation_df %>%
  select(N, distribution, interpolation) %>%
  inner_join(empirical_estimate_df, by = c("N", "distribution")) %>%
  mutate(Fy_tilde = map2_dbl(interpolation, y, ~ exec(.x, .y))) %>%
  select(-interpolation)
```


```{r, echo=FALSE}
interpolation_estimate_df %>%
  ggplot(aes(x = y, y = Fy_tilde)) +
  geom_hline(aes(yintercept = 0.1), color = "gray", linetype = "dashed") +
  geom_hline(aes(yintercept = 0.9), color = "gray", linetype = "dashed") +
  geom_line(aes(y = Fy), data = . %>% filter(N == 10), color = "black") +
  geom_line(aes(color = factor(N), group = N)) +
  scale_color_manual(values = c(`10` = "steelblue", `100` = "firebrick")) +
  labs(x = NULL, y = NULL,
       title = "Linear interpolation: True vs. <span style = 'color:steelblue'>N = 10</span> vs. <span style = 'color:firebrick'>N = 100</span>") +
  theme(
    plot.title = element_markdown(),
    plot.title.position = "plot",
    legend.position = "none"
  ) +
  facet_wrap( ~ distribution)
  
```

위 그래프에서, 추정된 함수는 step function이 아니라 각 $y$값이 증가함에 따라 함수 추정값도 증가하는 strictly increasing function이다. $N = 100$일 때는 앞에서 살펴본 step function과 근사한 값을 지니나, $N = 10$인 경우에는 step function과 큰 차이를 보인다.

위 piecewise linear function의 역함수를 구하여 아래와 같이 새로운 열에 저장하자.

```{r}
interpolation_df2 <- interpolation_df %>%
  mutate(interpolation_inv = map2(empirical, y, ~ approxfun(x = exec(.x, .y), y = .y)))
```


#### Exponential extrapolation

함수가 모든 $y \in (-\infty, \infty)$에 대해 정의되도록 하기 위해, 아래와 같이 $y \leq y_{low}$ 및 $y \geq y_{high}$에 대하여 함수를 정의해보자. 이 때, $y_{(1)} \leq y_{low} < y_{high} < y_{(N)}$이라 하고, $\tau_{low} = \tilde{F}_Y(y_{low}) < 0.5$, $\tau_{high} = \tilde{F}_Y(y_{high}) > 0.5$라 하자.

```{r}
extrapolate_cdf <- function(x, cdf, cdf_inv, tau_low, tau_high) {
  x_low <- cdf_inv(tau_low)
  x_high <- cdf_inv(tau_high)
  
  lambda_left <-
    log(1 / 2) / (cdf_inv(tau_low) - cdf_inv(2 * tau_low))
  lambda_right <-
    -log(1 / 2) / (cdf_inv(tau_high) - cdf_inv(2 * tau_high - 1))
  
  if (x < x_low) {
    # handle small values
    rtn <- tau_low * exp(lambda_left * (x - x_low))
  } else if (x >= x_high) {
    # handle large values
    rtn <- 1 - (1 - tau_high) * exp(-lambda_right * (x - x_high))
  } else {
    rtn <- cdf(x)
  }
  
  return(rtn)
}

```


```{r}
extrapolation_df <- interpolation_df2 %>%
  mutate(
    extrapolation =
      map2(interpolation, interpolation_inv,
           function(cdf, cdf_inv, tau_low, tau_high) {
             function(x)
               extrapolate_cdf(x, cdf, cdf_inv, tau_low, tau_high)
           },
           tau_low = 0.1,
           tau_high = 0.9)
  )
```


```{r}
extrapolation_estimate_df <- extrapolation_df %>%
  select(N, distribution, extrapolation) %>%
  inner_join(empirical_estimate_df, by = c("N", "distribution")) %>%
  mutate(Fy_tilde = map2_dbl(extrapolation, y, ~ exec(.x, .y))) %>%
  select(-extrapolation)
```

```{r, echo=FALSE}
extrapolation_estimate_df %>%
  ggplot(aes(x = y, y = Fy_tilde)) +
  geom_hline(aes(yintercept = 0.1), color = "gray", linetype = "dashed") +
  geom_hline(aes(yintercept = 0.9), color = "gray", linetype = "dashed") +
  geom_line(aes(y = Fy), data = . %>% filter(N == 10), color = "black") +
  geom_line(aes(color = factor(N), group = N)) +
  scale_color_manual(values = c(`10` = "steelblue", `100` = "firebrick")) +
  labs(x = NULL, y = NULL,
       title = "Linear interpolation + Exponential extrapolation: True vs. <span style = 'color:steelblue'>N = 10</span> vs. <span style = 'color:firebrick'>N = 100</span>") +
  theme(
    plot.title = element_markdown(),
    plot.title.position = "plot",
    legend.position = "none"
  ) +
  facet_wrap( ~ distribution)
  
```

위 extrapolation function의 역함수를 구하여 아래와 같이 새로운 열에 저장하자.

```{r}
extrapolate_cdf_inv <- function(p, cdf_inv, tau_low, tau_high) {
  lambda_left <-
    log(1 / 2) / (cdf_inv(tau_low) - cdf_inv(2 * tau_low))
  lambda_right <-
    -log(1 / 2) / (cdf_inv(tau_high) - cdf_inv(2 * tau_high - 1))
  
  if (p < tau_low) {
    # handle small values
    rtn <-
      log(p / (2 * tau_low)) / lambda_left + cdf_inv(2 * tau_low)
  } else if (p > tau_high) {
    # handle large values
    rtn <-
      -log((1 - p) / (2 * tau_high - 1)) / lambda_right +
      cdf_inv(2 * tau_high - 1)
  } else {
    rtn <- cdf_inv(p)
  }
  
  return(rtn)
}
```


```{r}
extrapolation_df2 <- extrapolation_df %>%
  mutate(extrapolation_inv =
           map(interpolation_inv, function(cdf_inv, tau_low, tau_high) {
             function(p)
               extrapolate_cdf_inv(p, cdf_inv, tau_low, tau_high)
           },
           tau_low = 0.1,
           tau_high = 0.9))
```


```{r}

```


### Probability integral transform


## Multivariate