09-BalancedandSpreaded.Rmd

# Balanced and well-spread sampling {#BalancedSpreaded}

In this chapter two related but fundamentally different sampling designs are described and illustrated. The similarity and difference are shortly outlined below, but hopefully will become clearer in following sections.

Roughly speaking, for a balanced sample the sample means of covariates are equal to the population means of these covariates. When the covariates are linearly related to the study variable, this may yield a more precise estimate of the population mean or total of the study variable.

A well-spread sample\index{Well-spread sample} is a sample with a large range of values for the covariates, from small to large values, but also including intermediate values. In more technical terms: the sampling units are well-spread along the axes spanned by the covariates. If the spatial coordinates are used as covariates (spreading variables), this results in samples that are well-spread in geographical space. Such samples are commonly referred to as spatially balanced samples, which is somewhat confusing, as the geographical spreading is not implemented through balancing on the geographical coordinates. On the other hand, the averages of the spatial coordinates of a sample well-spread in geographical space will be close to the population means of the coordinates. Therefore, the sample will be approximately balanced on the spatial coordinates [@Grafstrom2014]. The reverse is not true: with balanced sampling\index{Balanced sampling}, the spreading of the sampling units in the space spanned by the balancing variables can be poor. A sample with all values of a covariate used in balancing near the population mean of that variable has a poor spreading along the covariate axis, but can still be perfectly balanced.  

## Balanced sampling {#Balanced}

Balanced sampling is a sampling method that exploits one or more quantitative covariates that are related to the study variable. The idea behind balanced sampling is that, if we know the mean of the covariates, then the sampling efficiency can be increased by selecting a sample whose averages of the covariates must be equal to the population means of the covariates.

Let me illustrate balanced sampling with a small simulation study. The simulated population shown in Figure \@ref(fig:simpleexample) shows a linear trend from West to East and besides a trend from South to North. Due to the West-East trend, the simulated study variable $z$ is correlated with the covariate Easting and, due to the South-North trend, also with the covariate Northing. To estimate the population mean of the simulated study variable, intuitively  it is attractive to select a sample with an average of the Easting coordinate that is equal to the population mean of Easting (which is 10). Figure \@ref(fig:simpleexample) (subfigure on the left) shows such a sample of size four; we say that the sample is `balanced' on the covariate Easting. The sample in the subfigure on the right is balanced on Easting as well as on Northing.  

```{r simpleexample, echo = FALSE, out.width = "100%", fig.cap = "Sample balanced on Easting (E) and on Easting and Northing (E and N)."}
#define residual variogram for simulation
vgmodel <- vgm(model = "Exp", psill = 10, range = 4, nugget = 0)

#define discretisation grid
x <- 1:20 - 0.5
y <- x
grid <- expand.grid(x, y)
names(grid) <- c("x1", "x2")
distx <- outer(grid$x1, grid$x1, FUN = "-")
disty <- outer(grid$x2, grid$x2, FUN = "-")
dist <- sqrt(distx^2 + disty^2)

#compute matrix with covariances
C <- variogramLine(vgmodel, dist_vector = dist, covariance = TRUE)

#now simulate values for grid by Cholesky decomposition
Upper <- chol(C)

set.seed(31415)
G <- rnorm(n = nrow(grid), 0, 1) #simulate random numbers from standard normal distribution

#trend coefficient in x-direction
b1 <- 2
b2 <- 1
grid$z <- crossprod(Upper, G) + b1 * grid$x1 + b2 * grid$x2

#compute population size
N <- nrow(grid)

#set sample size)
n <- 4

#define matrix with covariate for balancing; first column of matrix must be filled with ones
X <- cbind(rep(1, times = N), grid$x1)

#compute inclusion probabilities; use equal probabilities
pi <- rep(n / N, times = N)

nsam <- 100
mx_pop <- mean(grid$x1)
set.seed(31415)
repeat {
    sample_ind <- samplecube(X = X, pik = pi, comment = FALSE, method = 1)
    mysample <- grid[sample_ind == 1, ]
    mx_sample <- mean(mysample$x1)
    if (mx_sample == mx_pop) {
      break
      }
}

#now select a sample balanced on Easting and Northing
X <- cbind(rep(1, times = N), grid$x1, grid$x2)

mx1_pop <- mean(grid$x1)
mx2_pop <- mean(grid$x2)
set.seed(314)
repeat {
  sample_ind <- samplecube(X = X, pik = pi, comment = FALSE, method = 1)
  mysample2 <- grid[sample_ind == 1, ]
  mx1_sample <- mean(mysample2$x1)
  mx2_sample <- mean(mysample2$x2)
  if (mx1_sample == mx1_pop & mx2_sample == mx2_pop) {
    break
    }
}

mysamples <- rbind(mysample, mysample2)
mysamples$samid <- rep(c("E", "E and N"), each = 4)

ggplot(data = grid) +
  geom_tile(mapping = aes(x = x1, y = x2, fill = z)) +
  geom_tile(data = mysamples, mapping = aes(x = x1, y = x2), colour = "white", size = 0.8, width = 1, height = 1, fill = NA) +
  scale_fill_viridis_c(name = "z") +
  scale_x_continuous(name = "Easting") +
  scale_y_continuous(name = "Northing") +
  facet_wrap(~ samid, ncol = 2, nrow = 1) +
  coord_fixed()
```

### Balanced sample vs. balanced sampling design

We must distinguish a balanced *sample* from a balanced sampling *design*. A sampling design is balanced on a covariate $x$ when *all possible* samples that can be generated by the design are balanced on $x$. 

```{block2, type = 'rmdnote'}
Simple random sampling is not a balanced sampling design, because for many simple random samples the sample mean of the balancing variable $x$ is not equal to the population mean of $x$. Only the *expectation* of the sample mean of $x$, i.e., the mean of the sample means obtained by selecting an infinite number of simple random samples, equals the population mean of $x$.
```

Figure \@ref(fig:scatterplotsqerror) shows for 1,000 simple random samples the squared error of the estimated population mean of the study variable $z$ against the difference between the sample mean of balancing variable Easting and the population mean of Easting. Clearly, the larger the absolute value of the difference, the larger on average the squared error. So, to obtain a precise and accurate estimate of the population mean of $z$, we better select samples with a difference close to 0.

(ref:scatterplotsqerrorlabel) Squared error in the estimated population mean of $z$ against the difference between the sample mean and the population mean of Easting, for 1,000 simple random samples of size four selected from the population shown in Figure \@ref(fig:simpleexample).

```{r scatterplotsqerror, echo = FALSE, fig.width = 5, fig.cap = "(ref:scatterplotsqerrorlabel)"}
nsam <- 1000
n <- 4
mz_sample <- mx_sample <- numeric(length = nsam)
set.seed(31415)
for (i in 1:nsam) {
  #select simple random sample from the data
  sample_ind <- sample(nrow(grid), size = n)
  mz_sample[i] <- mean(grid$z[sample_ind])
  mx_sample[i] <- mean(grid$x1[sample_ind])
}

sqerror <- (mz_sample - mean(grid$z))^2
devx <- mx_sample - mean(grid$x1)
df <- data.frame(sqerror, devx)
ggplot(data = df) +
  geom_point(mapping = aes(y = sqerror, x = devx), size = 2) +
  scale_x_continuous(name = "Sample mean Easting - Population mean Easting", limits = c(-9, 9)) +
  scale_y_continuous(name = "Squared error")
```

Using only Easting as a balancing variable reduces the sampling variance of the estimator of the mean substantially. Using Easting and Northing as balancing variables further reduces the sampling variance. See Table \@ref(tab:tablebalanced).  

```{r tablebalanced, echo = FALSE}
tbl <- data.frame(x = c("SI", "Balanced", "Balanced"), y = c("-", "Easting", "Easting and Northing"), z = c(39.7, 14.4, 9.77))

knitr::kable(
  tbl, caption = "Sampling variance of the $\\pi$ estimator of the mean for simple random sampling (SI) and balanced sampling of four units.",
  col.names = c("Sampling design", "Balancing variables", "Sampling variance"),
  booktabs = TRUE,
  linesep = ""
) %>%
  kable_classic()
```

### Unequal inclusion probabilities

Until now we have assumed that the inclusion probabilities of the population units are equal, but this is not a requirement for balanced sampling designs. A more general definition of a balanced sampling design is as follows. A sampling design is balanced on variable $x$ when for all samples generated by the design the $\pi$ estimate of the population total of $x$ equals the population total of $x$:  

\begin{equation}
\sum_{k \in \mathcal{S}} \frac{x_k}{\pi_k}= \sum_{k=1}^{N} x_k \;.
(\#eq:generaldefinitionbalanced)
\end{equation}

Similar to the regression estimator\index{Regression estimator} explained in the next chapter (Subsection \@ref(RegressionEstimator)), balanced sampling exploits the linear relation between the study variable and one or more covariates. When using the regression estimator, this is done at the estimation stage. Balanced sampling does so at the sampling stage. For a single covariate the regression estimator of the population total equals (see Equation \@ref(eq:SimpleRegressionEstimatorSI))

\begin{equation}
\hat{t}_{\mathrm{regr}}(z) = \hat{t}_{\pi}(z) + \hat{b}\left(t(x) - \hat{t}_{\pi}(x)\right) \;,
(\#eq:RegressionEstimatoranydesign)
\end{equation}

with $\hat{t}_{\pi}(z)$ and $\hat{t}_{\pi}(x)$ the $\pi$ estimators of the population total of the study variable $z$ and the covariate $x$, respectively, $t(x)$ the population total of the covariate, and $\hat{b}$ the estimated slope parameter (see hereafter). With a perfectly balanced sample the second term in the regression estimator, which adjusts the $\pi$ estimator, equals zero.

Balanced samples can be selected with the cube algorithm\index{Cube algorithm for balanced sampling} of @Deville2004. The population total and mean can be estimated by the $\pi$ estimator. The approximated variance of the $\pi$ estimator of the population mean can be estimated by (@Deville2005, @Grafstrom2013) 

\begin{equation}
\widehat{V}(\hat{\bar{z}}) = \frac{1}{N^2}\frac{n}{n-p} \sum_{k \in \mathcal{S}} c_k \left(\frac{e_k}{\pi_k}\right)^2 \;,
(\#eq:approxvarianceBalanced)
\end{equation}

with $p$ the number of balancing variables, $c_k$ a weight for unit $k$ (see hereafter), and $e_k$ the residual of unit $k$ given by

\begin{equation}
e_k = z_k - \mathbf{x}_k^{\text{T}}\hat{\mathbf{b}} \;,
(\#eq:residualsBalanced)
\end{equation}

with $\mathbf{x}_k$ a vector of length $p$ with the balancing variables for unit $k$, and $\hat{\mathbf{b}}$ the estimated population regression coefficients\index{Population regression coefficient}, given by

\begin{equation}
\hat{\mathbf{b}} = \left(\sum_{k \in \mathcal{S}} c_k \frac{\mathbf{x}_k}{\pi_k} \frac{\mathbf{x}_k}{\pi_k}^{\text{T}} \right)^{-1} \sum_{k \in \mathcal{S}} c_k \frac{\mathbf{x}_k}{\pi_k} \frac{z_k}{\pi_k} \;.
(\#eq:betasbalanced)
\end{equation}

Working this out for balanced sampling without replacement with equal inclusion probabilities, $\pi_k = n/N,\; k = 1, \dots , N$, yields

\begin{equation}
\widehat{V}(\hat{\bar{z}}) = \frac{1}{n(n-p)} \sum_{k \in \mathcal{S}} c_k e_k^2 \;.
(\#eq:approxvarianceBalancedSI)
\end{equation}

@Deville2005 give several formulas for computing the weights $c_k$, one of which is $c_k = (1-\pi_k)$.

Balanced sampling is now illustrated with aboveground biomass (AGB) data of Eastern Amazonia, see Figure \@ref(fig:mapsAmazonia). Log-transformed short-wave infrared radiation (lnSWIR2) is used as a balancing variable. The `samplecube` function of the **sampling** package [@Tille2016] implements the cube algorithm. Argument `X` of this function specifies the matrix of ancillary variables on which the sample must be balanced. The first column of this matrix is filled with ones, so that the sample size is fixed. To speed up the computations, a 5 km $\times$ 5 km subgrid of `grdAmazonia` is used.

```{block2, type='rmdnote'}
Recall that a sample is balanced on a covariate $x$ if the $\pi$ estimate of the population total of $x$ is equal to the known true population total of $x$ (Equation \@ref(eq:generaldefinitionbalanced)). If we know the total number of units in a population, $N$, we can balance the sample on this known total using a constant with value 1 as a balancing variable. Only for samples of size $n$ the $\pi$ estimate of the total number of population units equals $N$: $\sum_{k \in \mathcal{S}} 1/\pi_k = N$ for $|\mathcal{S}| = n$.
```

Equal inclusion probabilities are used, i.e., for all population units the inclusion probability equals $n/N$.

```{r, echo = FALSE}
grdAmazonia <- read_rds(file = "results/grdAmazonia_5km.rds")
```

 
```{r}
grdAmazonia <- grdAmazonia %>%
  mutate(lnSWIR2 = log(SWIR2))
library(sampling)
N <- nrow(grdAmazonia)
n <- 100
X <- cbind(rep(1, times = N), grdAmazonia$lnSWIR2)
pi <- rep(n / N, times = N)
sample_ind <- samplecube(X = X, pik = pi, comment = FALSE, method = 1)
eps <- 1e-6
units <- which(sample_ind > (1 - eps))
mysample <- grdAmazonia[units, ]
```

The population mean can be estimated by the sample mean.

```{r}
mz <- mean(mysample$AGB)
```

To estimate the variance, a function is defined for estimating the population regression coefficients.

```{r}
estimate_b <- function(z, X, c) {
  cXX <- matrix(nrow = ncol(X), ncol = ncol(X), data = 0)
  cXz <- matrix(nrow = 1, ncol = ncol(X), data = 0)
  for (i in seq_len(length(z))) {
    x <- X[i, ]
    cXX_i <- c[i] * (x %*% t(x))
    cXX <- cXX + cXX_i
    cXz_i <- c[i] * t(x) * z[i]
    cXz <- cXz + cXz_i
  }
  b <- solve(cXX, t(cXz))
  return(b)
}
```

The next code chunk shows how the estimated variance of the $\pi$ estimator of the population mean can be computed.

```{r}
pi <- rep(n / N, n)
c <- (1 - pi)
b <- estimate_b(z = mysample$AGB / pi, X = X[units, ] / pi, c = c)
zpred <- X %*% b
e <- mysample$AGB - zpred[units]
v_tz <- n / (n - ncol(X)) * sum(c * (e / pi)^2)
v_mz <- v_tz / N^2
```

Figure \@ref(fig:BalancedSampleAmazonia) shows the selected balanced sample. Note the spatial clustering of some units. The estimated population mean (as estimated by the sample mean) of AGB equals `r formatC(mz, 1, format = "f")` 10^9^ kg ha^-1^. The population mean of AGB equals `r formatC(mean(grdAmazonia$AGB), 1, format = "f")` 10^9^ kg ha^-1^. The standard error of the estimated mean equals `r formatC(sqrt(v_mz), 1, format = "f")` 10^9^ kg ha^-1^.

```{r BalancedSampleAmazonia, echo = FALSE, out.width = "100%", fig.cap = "Sample balanced on lnSWIR2 from Eastern Amazonia."}
ggplot(data = grdAmazonia) +
  geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = lnSWIR2)) +
  geom_point(data = mysample, mapping = aes(x = x1 / 1000, y = x2 / 1000), size = 2) +
  scale_fill_viridis_c(name = "lnSWIR2") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

Figure \@ref(fig:SamplingDistributionBalanced) shows the approximated sampling distribution of the $\pi$ estimator of the mean AGB with balanced sampling and simple random sampling, obtained by repeating the random sampling with both designs and estimation 1,000 times.

```{r, echo = FALSE, eval = FALSE}
mz <- v_mz <- mx_sample <- mz_SI <- numeric(length = 1000)

pi <- rep(n / N, n)
c <- (1 - pi)

set.seed(314)
for (i in 1:1000) {
  sample_ind <- samplecube(X = X, pik = rep(n / N, N), comment = FALSE, method = 1)
  units <- which(sample_ind > (1 - eps))
  mysample <- grdAmazonia[units, ]

  mz[i] <- mean(mysample$AGB)

  b <- estimate_b(z = mysample$AGB / pi, X = X[units, ] / pi, c = c)
  zpred <- X %*% b
  e <- mysample$AGB - zpred[units]
  v_tz <- n / (n - ncol(X)) * sum(c * (e / pi)^2)
  v_mz[i] <- v_tz / N^2

  mx_sample[i] <- mean(mysample$lnSWIR2)

  units <- sample(nrow(grdAmazonia), size = n, replace = FALSE)
  mz_SI[i] <- mean(grdAmazonia$AGB[units])
}
save(mz, mz_SI, v_mz, mx_sample, file = "results/Balanced_Amazonia.rda")
```

(ref:SamplingDistributionBalancedlabel) Approximated sampling distribution of the $\pi$ estimator of the mean AGB (10^9^ kg ha^-1^) in Eastern Amazonia, with balanced sampling, balanced on lnSWIR2 and using equal inclusion probabilities, and simple random sampling without replacement (SI) of 100 units.

```{r SamplingDistributionBalanced, echo = FALSE, fig.width = 5, fig.asp = .8, fig.cap = "(ref:SamplingDistributionBalancedlabel)"}
load(file = "results/Balanced_Amazonia.rda")
estimates <- data.frame(mz, mz_SI, v_mz, mx_sample)
names(estimates)[c(1, 2)] <- c("Balanced", "SI")

df <- estimates %>% pivot_longer(cols = c("Balanced", "SI"))
ggplot(data = df) +
    geom_boxplot(aes(y = value, x = name)) +
    geom_hline(yintercept = mean(grdAmazonia$AGB), colour = "red") +
    scale_x_discrete(name = "Sampling design") +
    scale_y_continuous(name = "Estimated mean AGB")
```

The variance of the 1,000 estimates of the population mean of the study variable AGB equals `r formatC(var(estimates$Bal), 1, format = "f")` (10^9^ kg ha^-1^)^2^. The gain in precision compared to simple random sampling equals `r formatC(var(estimates$SI)/var(estimates$Bal), 3, format = "f")` (design effect is `r formatC(var(estimates$Bal)/var(estimates$SI), 3, format = "f")`), so with simple random sampling about three times more sampling units are needed to estimate the population mean with the same precision. The mean of the 1,000 estimated variances equals `r formatC(mean(estimates$v_mz), 1, format = "f")` (10^9^ kg ha^-1^)^2^, indicating that the approximated variance estimator somewhat underestimates the true variance in this case. The population mean of the balancing variable lnSWIR2 equals `r formatC(mean(grdAmazonia$lnSWIR2), 3, format = "f")`. The sample mean of lnSWIR2 varies a bit among the samples. Figure \@ref(fig:histSampleMeanSWIR) shows the approximated sampling distribution of the sample mean of lnSWIR2. In other words, many samples are not perfectly balanced on lnSWIR2. This is not exceptional; in most cases perfect balance is impossible.

```{r histSampleMeanSWIR, echo = FALSE, fig.width = 5, fig.cap = "Approximated sampling distribution of the sample mean of balancing variable lnSWIR2 in Eastern Amazonia with balanced sampling of size 100 and equal inclusion probabilities."}
ggplot(data = estimates) +
  geom_histogram(aes(x = mx_sample, y = ..density..), binwidth = 0.0025, fill = "black", alpha = 0.5, color = "black") +
  geom_density(aes(x = mx_sample, y = ..density..), lwd = 1, adjust = 2) +
  scale_x_continuous(name = "Sample mean lnSWIR2") +
  scale_y_continuous(name = "Density")
```

#### Exercises {-}

1. Select a sample of size 100 balanced on lnSWIR2 and Terra_PP from Eastern Amazonia, using equal inclusion probabilities for all units.   
    + First, select a subgrid of 5 km $\times$ 5 km using function `spsample`, see Chapter \@ref(SY).
    + Estimate the population mean.
    + Estimate the standard error of the $\pi$ estimator. First, estimate the regression coefficients, using function `estimate_b` defined above, then compute the residuals, and finally compute the variance.    
2. Spatial clustering of sampling units is not avoided in balanced sampling. What effect do you expect of this spatial clustering on the precision of the estimated mean? Can you think of a situation where this effect does not occur?   

### Stratified random sampling {#StratifiedsamplingasBalancedsampling}

Much in the same way as we controlled in the previous subsection the sample size $n$ by balancing the sample on the known total number of population units $N$, we can balance a sample on the known total number of units in subpopulations.  A sample balanced on the sizes of subpopulations is a stratified random sample. Figure \@ref(fig:BalancedSampleCategorical) shows four subpopulations or strata. These four strata can be used in balanced sampling by constructing the following design matrix\index{Design matrix} $\mathbf{X}$ with as many columns as there are strata and as many rows as there are population units:

\begin{equation}
\mathbf{X}=
\begin{bmatrix}
1 &0 &0 &0 \\
1 &0 &0 &0 \\
1 &0 &0 &0 \\
1 &0 &0 &0 \\
0 &1 &0 &0 \\
0 &1 &0 &0 \\
0 &0 &1 &0 \\
\vdots &\vdots &\vdots &\vdots\\
0 & 0 & 0 & 1 \\
\end{bmatrix}  \;.
\end{equation}

The first four rows refer to the four leftmost bottom row population units in Figure \@ref(fig:BalancedSampleCategorical). These units belong to class A, which explains that the first column for these units contain ones. The other three columns for these rows contain all zeroes. The fifth and sixth unit belong to stratum B, so that the second column for these rows contain ones, and so on. The final row is the upperright sampling unit in stratum D, so the first three columns contain zeroes, and the fourth column is filled with a one. The sum of the indicators in the columns is the total number of population units in the strata.

```{r BalancedSampleCategorical, echo = FALSE, fig.width = 5, fig.cap = "Sample balanced on a categorical variable with four classes."}
s1 <- s2 <- 1:20 - 0.5
mypop <- expand.grid(s1, s2)
names(mypop) <- c("s1", "s2")
mypop$stratum <- as.factor(findInterval(mypop$s1, c(4, 6, 14)))
levels(mypop$stratum) <- LETTERS[1:4]

N_h <- tapply(mypop$s1, INDEX = mypop$stratum, FUN = length)
n_h <- rep(5, times = 4)
labels <- sort(unique(mypop$stratum))
lut <- data.frame(stratum = labels, pi = n_h / N_h)
mypop <- merge(x = mypop, y = lut)

X <- model.matrix(~ stratum - 1, mypop)

set.seed(314)
sample_ind <- samplecube(X = X, pik = mypop$pi, comment = FALSE, method = 1)
mysample <- mypop[sample_ind == 1, ]

ggplot(data = mypop) +
  geom_tile(mapping = aes(x = s1, y = s2, fill = factor(stratum)), width = 1, height = 1, size = 0.5, colour = "white") +
  geom_tile(data = mysample, mapping = aes(x = s1, y = s2), fill = NA, width = 1, height = 1, size = 0.7, colour = "black") +
  scale_fill_viridis_d(name = "Stratum") +
  scale_x_continuous(name = "Easting") +
  scale_y_continuous(name = "Northing") +
  coord_fixed()
```

In the next code chunk, the inclusion probabilities are computed by $\pi_{hk}=n_h/N_h,\; k=1, \dots , N_h$, with $n_h=5$ for all four strata. The stratum sample sizes are equal, but the number of population units of a stratum differ among the strata, so the inclusion probabilities also differ among the strata.

```{r, echo=FALSE}
mypop <- mypop[, -4]
```

```{r}
N_h <- tapply(mypop$s1, INDEX = mypop$stratum, FUN = length)
n_h <- rep(5, times = 4)
n <- sum(n_h)
print(pi_h <- n_h / N_h)
```

The inclusion probabilities are added to tibble `mypop` with the 400 population units, using a look-up table `lut` and function `left_join`. The ten leftmost units on the bottom row of Figure \@ref(fig:BalancedSampleCategorical) are shown below. Variables s1 and s2 are the spatial coordinates of the centres of the units. 

```{r, echo = FALSE, eval = FALSE}
lut <- data.frame(stratum = sort(unique(mypop$stratum)), pi = n_h / N_h)
mypop <- merge(x = mypop, y = lut, by = "stratum")
ord <- order(mypop$s2, mypop$s1)
head(mypop[ord, ], 7)
```
```{r}
lut <- data.frame(stratum = sort(unique(mypop$stratum)), pi = n_h / N_h)
mypop <- mypop %>%
  as.tibble(mypop) %>%
  left_join(lut, by = "stratum") %>%
  arrange(s2, s1)
mypop
```

Next, the design matrix $\mathbf{X}$ is computed with function `model.matrix`, expanding the factor `stratum` to a set of dummy variables, see code chunk below. By adding `- 1` to the formula, the usual first column with ones in the design matrix is omitted. The design matrix has four columns with dummy variables (indicators), indicating to which stratum a unit belongs. 

The columns in the design matrix with dummy variables are multiplied by the vector with inclusion probabilities, using function `sweep `, resulting in the following design matrix:

\begin{equation}
\mathbf{X}=
\begin{bmatrix}
0.0625 &0 &0 &0 \\
0.0625 &0 &0 &0 \\
0.0625 &0 &0 &0 \\
0.0625 &0 &0 &0 \\
0 &0.125 &0 &0 \\
0 &0.125 &0 &0 \\
0 &0 &0.03125 &0 \\
\vdots &\vdots &\vdots &\vdots\\
0 & 0 & 0 &0.04167 \\
\end{bmatrix}  \;.
\end{equation}

The multiplication by the inclusion probabilities is not strictly needed. Using the design matrix with dummy variables implies that we balance the sample on the known total number of population units in the strata, $N_h$. For samples with stratum sample sizes equal to $n_h$, the sample sums of the dummy variables used in balancing, divided by the inclusion probability, are equal to $N_h$.

Multiplication of the dummy variables by the vector with inclusion probabilities implies that we balance the sample on the population totals of the inclusion probabilities, which are equal to the targeted stratum sample sizes. For samples with stratum sample sizes $n_h$ equal to these targeted sample sizes, the sample sums of the balancing variables, divided by the inclusion probability (having value $\pi_{hk}/\pi_{hk}=1$ or 0), are equal to the targeted sample sizes.

```{r }
X <- model.matrix(~ stratum - 1, data = mypop)
X <- sweep(X, MARGIN = 1, mypop$pi, `*`)
set.seed(314)
sample_ind <- samplecube(X = X, pik = mypop$pi, comment = FALSE, method = 1)
mysample <- mypop[sample_ind > (1 - eps), ]
```

In the above example all units in a stratum have the same inclusion probability, yielding a stratified simple random sample. We may also use variable inclusion probabilities, for instance proportional to a size measure of the units, yielding a stratified ppswor random sample (Section \@ref(ppswor)).

The advantage of selecting a stratified random sample by balancing the sample on a categorical variable becomes clear in case we have multiple classifications that we would like to use in stratification, and we cannot afford to use all cross-classifications as strata. This is the topic of the next subsection.

### Multiway stratification {#Multiwaystratification}

@Falorsi2008 describe how a multiway stratified sample\index{Multiway stratification} can be selected as a balanced sample. Multiway stratification is of interest when one has multiple stratification variables, each stratification variable leading to several strata, so that the total number of cross-classification strata\index{Cross-classification stratum} becomes so large that the stratum sample sizes are strongly disproportional to their size or even exceed the total sample size. For instance, suppose we have three maps with 4, 3, and 5 map units. Further, suppose that all combinations of map units are non-empty, so that we have 4 $\times$ 3 $\times$ 5 = 60 combinations. We may not like to use all combinations (cross-classifications) as strata. The alternative is then to use the $4+3+5=12$ map units as strata.

The sample sizes of the marginal strata can be controlled using a design matrix with as many columns as there are strata. The units of an individual map used for stratification are referred to as marginal strata\index{Marginal stratum}. Each row $k = 1, \dots, N$ in the design matrix  $\mathbf{X}$ has as many non-zero values as we have maps, in entries corresponding to the cross-classification map unit of population unit $k$, and zeroes in the remaining entries. The non-zero value is the inclusion probability of that unit. Each column of the design matrix has non-zero values in entries corresponding to the population units in that marginal stratum and zeroes in all other entries.

Two-way stratified random sampling is illustrated with a simulated population of 400 units (Figure \@ref(fig:TwowaystratifiedPopulation)). Figure \@ref(fig:Twowaystratifiedsample) shows two classifications of the population units. Classification A consists of four classes (map units), classification B of three classes. Instead of using $4 \times 3 = 12$ cross-classifications as strata in random sampling, only $4+3=7$ marginal strata are used in two-way stratified random sampling.

```{r TwowaystratifiedPopulation, echo = FALSE, fig.width = 5, fig.cap = "Simulated population used for illustration of two-way stratified random sampling."}
s1 <- s2 <- 1:20 - 0.5
mypop <- expand.grid(s1, s2)
names(mypop) <- c("s1", "s2")

vgmodel <- vgm(model = "Exp", psill = 4, range = 4, nugget = 0)

dists1 <- outer(mypop$s1, mypop$s1, FUN = "-")
dists2 <- outer(mypop$s2, mypop$s2, FUN = "-")
dist <- sqrt(dists1^2 + dists2^2)
C <- variogramLine(vgmodel, dist_vector = dist, covariance = TRUE)

Upper <- chol(C)
set.seed(314)
N <- nrow(mypop)
G <- rnorm(N, 0, 1) #simulate random numbers from standard normal distribution
mypop$z <- crossprod(Upper, G) + 10

mypop$A <- as.factor(findInterval(mypop$s1, c(5, 8, 12)))
mypop$B <- as.factor(findInterval(mypop$s2, c(8, 15)))
levels(mypop$A) <- c("A1", "A2", "A3", "A4")
levels(mypop$B) <- c("B1", "B2", "B3")

ggplot(data = mypop) +
  geom_tile(mapping = aes(x = s1, y = s2, fill = z), width = 1, height = 1, size = 0.5) +
  scale_fill_viridis_c() +
  scale_x_continuous(name = "Easting") +
  scale_y_continuous(name = "Northing") +
  coord_fixed()
```

As a first step, the inclusion probabilities are added to `data.frame` `mypop` with the spatial coordinates and simulated values. To keep it simple, I computed inclusion probabilities equal to 2 divided by the number of population units in a cross-classification stratum. Note that this does not imply that a sample is selected with two units per cross-classification stratum. As we will see later, it is possible that in some  cross-classification strata no units are selected at all, while in other cross-classification strata more than two units are selected. In multiway stratified sampling, the marginal stratum sample sizes are controlled. The inclusion probabilities  should result in six selected units for all four units of map A and eight selected units for all three units of map B.   

```{r}
mypop <- mypop %>%
    group_by(A, B) %>%
    summarise(N_h = n(), .groups = "drop") %>%
    mutate(pih = rep(2, 12) / N_h) %>%
    right_join(mypop, by = c("A", "B"))
```

The next step is to create the design matrix. Two submatrices are computed, one per stratification. The two submatrices are joined columnwise, using function `cbind`.  The columns are multiplied by the vector with inclusion probabilities.

```{r}
XA <- model.matrix(~ A - 1, mypop)
XB <- model.matrix(~ B - 1, mypop)
X <- cbind(XA, XB)
X <- sweep(X, MARGIN = 1, mypop$pih, `*`)
```

Matrix $\mathbf{X}$ can be reduced by one column if in the first column the inclusion probabilities of *all* population units are inserted. This first column contains no zeroes. Balancing on this variable implies that the total sample size is controlled. Now there is no need anymore to control the sample sizes of all marginal strata. It is sufficient to control the sample sizes of three marginal strata of map A (A2, A3, and A4) and two marginal strata of map B (B2 and B3). Given the total sample size, the sample sizes of map units A1 and B1 then cannot be chosen freely anymore.

```{r}
X <- model.matrix(~ A + B, mypop)
colnames(X)
X <- sweep(X, MARGIN = 1, mypop$pih, `*`)
```

This reduced design matrix is not strictly needed for selecting a multiway stratified sample, but it must be used in estimation. If in estimation, as many balancing variables are used as we have marginal strata, the matrix with the sum of squares of the balancing variables (first sum in Equation \@ref(eq:betasbalanced)) cannot be inverted (the matrix is singular), and as a consequence the population regression coefficients cannot be estimated.

Finally, the two-way stratified random sample is selected with function `samplecube` of package **sampling**. 

```{r}
sample_ind <- samplecube(X = X, pik = mypop$pih, method = 1, comment = FALSE)
units <- which(sample_ind > (1 - eps))
mysample <- mypop[units, ]
```

Figure \@ref(fig:Twowaystratifiedsample) shows the selected sample.

```{r, Twowaystratifiedsample, echo = FALSE, out.width = '100%', fig.cap = "Two-way stratified sample."}
plt1 <- ggplot(data = mypop) +
  geom_tile(mapping = aes(x = s1, y = s2, fill = factor(A)), width = 1, height = 1, size = 0.5, colour = "white") +
  geom_tile(data = mysample, mapping = aes(x = s1, y = s2), fill = NA, width = 1, height = 1, size = 0.7, colour = "black") +
  scale_fill_viridis_d(name = "Stratum") +
  scale_x_continuous(name = "") +
  scale_y_continuous(name = "") +
  coord_fixed()

plt2 <- ggplot(data = mypop) +
  geom_tile(mapping = aes(x = s1, y = s2, fill = factor(B)), width = 1, height = 1, size = 0.5, colour = "white") +
  geom_tile(data = mysample, mapping = aes(x = s1, y = s2), fill = NA, width = 1, height = 1, size = 0.7, colour = "black") +
  scale_fill_viridis_d(name = "Stratum") +
  scale_y_continuous(name = "") +
  scale_x_continuous(name = "") +
  coord_fixed()

grid.arrange(plt1, plt2, nrow = 1)
```

All marginal stratum sample sizes of map A are six and all marginal stratum sample sizes of map B are eight, as expected. The sample sizes of the cross-classification strata vary from zero to four.

```{r}
addmargins(table(mysample$A, mysample$B))
```

The population mean can be estimated by the $\pi$ estimator.

```{r}
N <- nrow(mypop)
print(mean <- sum(mysample$z / mysample$pih) / N)
```

The variance is estimated as before (Equation \@ref(eq:approxvarianceBalanced)).

```{r}
c <- (1 - mysample$pih)
b <- estimate_b(
  z = mysample$z / mysample$pih, X = X[units, ] / mysample$pih, c = c)
zpred <- X %*% b
e <- mysample$z - zpred[units]
n <- nrow(mysample)
v_tz <- n / (n - ncol(X)) * sum(c * (e / mysample$pih)^2)
print(v_mz <- v_tz / N^2)
```

## Well-spread sampling {#Spreaded}

With balanced sampling, the spreading of the sampling units in the space spanned by the balancing variables can be poor. For instance, in Figure \@ref(fig:simpleexample) the Easting coordinates of all units of a sample balanced on Easting can be equal or close to the population mean of 10. So, in this example, balancing does not guarantee a good geographical spreading. Stated more generally, a balanced sample can be selected that shows strong clustering in the space spanned by the balancing variables. This clustering may inflate the standard error of the estimated population total and mean. The clustering in geographical or covariate space can be avoided by the local pivotal method [@Grafstrom2012] and the spatially correlated Poisson sampling method\index{Spatially correlated Poisson sampling} [@Grafstrom2012b]. 

```{block2, type = 'rmdnote'}
For spreading in *geographical* space various other designs are available. A simple design is stratified random sampling from compact geographical strata, see Section \@ref(geostrata). Alternative designs are generalised random-tessellation stratified sampling [@stevens2004], see Subsection \@ref(GRTS), and balanced acceptance sampling\index{Balanced acceptance sampling} [@Robertson2013].
```

### Local pivotal method {#LPM}

The local pivotal method (LPM) is a modification of the pivotal method explained in Subsection \@ref(pivotalmethod)\index{Local pivotal method}. The only difference with the pivotal method is the selection of the pairs of units. In the pivotal method, at each step two units are selected, for instance, the first two units in the vector with inclusion probabilities after randomising the order of the units. In the local pivotal method, the first unit is selected fully randomly, and the nearest neighbour of this unit is used as its counterpart. Recall that when one unit of a pair is included in the sample, the inclusion probability of its counterpart is decreased. This leads to a better spreading of the sampling units in the space spanned by the spreading variables.

LPM can be used for arbitrary inclusion probabilities. The inclusion probabilities can be equal, but as with the pivotal method these probabilities may also differ among the population units.   

Selecting samples with LPM can be done with functions `lpm`, `lpm1`, or `lpm2` of package **BalancedSampling** [@Grafstrom2016]. Functions `lpm1` and `lpm2` only differ in the selection of neighbours that are allowed to compete, for details see @Grafstrom2012. For most populations, the two algorithms perform similarly. The algorithm implemented in function `lpm` is only recommended when the population size is too large for `lpm1` or `lpm2`. It only uses a subset of the population in search for nearest neighbours and is thus not as good. Another function ` lpm2_kdtree` of package  **SamplingBigData** [@samplingBigData] is developed for big data sets.

Inclusion probabilities are computed with function `inclusionprobabilities` of package **sampling**.  A matrix $\mathbf{X}$ must be defined with the values of the spreading variables of the population units. Figure \@ref(fig:LPMKandahar) shows a ppswor sample of 40 units selected from the sampling frame of Kandahar, using the spatial coordinates of the population units as spreading variables. Inclusion probabilities are proportional to the agricultural area within the population units. The geographical spreading is improved compared with the sample shown in Figure \@ref(fig:ppswrKandahar). 

```{r}
library(BalancedSampling)
library(sampling)
n <- 40
pi <- inclusionprobabilities(grdKandahar$agri, n)
X <- cbind(grdKandahar$s1, grdKandahar$s2)
set.seed(314)
units <- lpm1(pi, X)
myLPMsample <- grdKandahar[units, ]
```

```{r LPMKandahar, echo = FALSE, out.width = '100%', fig.cap = "Spatial ppswor sample of size 40 from Kandahar, selected by the local pivotal method, using agricultural area as a size variable."}
ggplot(data = grdKandahar) +
  geom_raster(mapping = aes(x = s1 / 1000, y = s2 / 1000, fill = agri)) +
  geom_tile(data = myLPMsample, mapping = aes(x = s1 / 1000, y = s2 / 1000), colour = "white", size = 0.7, width = 5, height = 5, fill = NA) +
  scale_fill_viridis_c(name = "Agriarea") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

The total poppy area can be estimated with the $\pi$ estimator (Equation \@ref(eq:HTTotal)).

```{r}
myLPMsample$pi <- pi[units]
tz_HT <- sum(myLPMsample$poppy / myLPMsample$pi)
```

The estimated total poppy area equals `r formatC(tz_HT, 0, format = "f", big.mark = ",")` ha. The sampling variance of the estimator of the population total with the local pivotal method can be estimated by [@Grafstrom2014]

\begin{equation}
\widehat{V}(\hat{t}(z)) = \frac{1}{2} \sum_{k \in \mathcal{S}} \left( \frac{z_k}{\pi_k} - \frac{z_{k_j}}{\pi_{k_j}} \right)^2 \;,
(\#eq:VartotalLPM)
\end{equation}

with ${k_j}$ the nearest neighbour of unit $k$ in the sample. This variance estimator is for the case where we have only one nearest neighbour.

Function `vsb` of package **BalancedSampling** is an implementation of a more general variance estimator that accounts for more than one nearest neighbour (equation (6) in @Grafstrom2014). We expect a somewhat smaller variance compared to pps sampling, so we may use the variance of the pwr estimator (Equation \@ref(eq:VarHHTotalppswr)) as a conservative variance estimator\index{Conservative variance estimator}.

```{r vartotLPM}
Xsample <- X[units, ]
se_tz_HT <- sqrt(vsb(pi[units], myLPMsample$poppy, Xsample))
pk <- myLPMsample$pi / n
se_tz_pwr <- sqrt(var(myLPMsample$poppy / pk) / n)
```

The standard error obtained with function `vsb` equals `r formatC(se_tz_HT, 0, format = "f", big.mark = ",")` ha, the standard error of the pwr estimator equals `r formatC(se_tz_pwr, 0, format = "f", big.mark = ",")` ha.

As explained above, the LPM design can also be used to select a probability sample well-spread in the space spanned by one or more quantitative covariates. Matrix $\mathbf{X}$ then should contain the values of the *scaled* (standardised) covariates instead of the spatial coordinates.

#### Exercises {-}

3. Geographical spreading of the sampling units can also be achieved by random sampling from compact geographical strata (geostrata) (Section \@ref(geostrata)). Can you think of one or more advantages of LPM sampling over random sampling from geostrata?  

### Generalised random-tessellation stratified sampling {#GRTS}

Generalised random-tessellation stratified (GRTS) sampling\index{Generalised random-tessellation stratified sampling}  is designed for sampling discrete objects scattered throughout space, think for instance of the lakes in Finland, segments of hedgerows in England, etc.  Each object is represented by a point in 2D-space. It is a complicated design, and for sampling points from a continuous universe or raster cells from a finite population, I recommend more simple designs such as the local pivotal method (Subsection \@ref(LPM)), balanced sampling with geographical spreading (Section \@ref(BalancedandSpreaded)), or sampling from compact geographical strata (Section \@ref(geostrata)).  Let me try to explain the GRTS design with a simple example of a finite population of point objects in a circular study area (Figure  \@ref(fig:GRTSNumbering)). For a more detailed description of this design, refer to @Hankin2019. As a first step, a square bounding box of the study area is constructed. This bounding box is recursively partitioned into square grid cells. First, 2 $\times$ 2 grid cells are constructed. These grid cells are numbered in a predefined order. In Figure  \@ref(fig:GRTSNumbering) this numbering is from lower left, lower right, upper left to upper right. Each grid cell is then subdivided into four subcells; the subcells are numbered using the same order. This is repeated until at most one population unit occurs in each subcell. For our population only two iterations were needed, leading to 4 $\times$ 4 subcells. Note that two subcells are empty. Each address of a subcell consists of two digits: the first digit refers to the grid cell, the second digit to the subcell. 

```{r GRTSNumbering, echo = FALSE, out.width = "100%", fig.cap = "Numbering of grid cells and subcells for GRTS sampling."}
s1 <- s2 <- 1:100 - 0.5
pxl <- expand.grid(s1, s2)
names(pxl) <- c("s1", "s2")

#construct strata level 1
s1bnd <- 50
s1f <- findInterval(pxl$s1, s1bnd)
s2f <- findInterval(pxl$s2, s1bnd)
pxl$partit1 <- factor(interaction(s1f, s2f), labels = c(1, 2, 3, 4))

#construct strata level 2
s1bnd <- seq(from = 25, to = 75, by = 25)
s1f <- findInterval(pxl$s1, s1bnd)
s2f <- findInterval(pxl$s2, s1bnd)
lbl <- c("00", "01", "10", "11", "02", "03", "12", "13", "20", "21", "30", "31", "22", "23", "32", "33")
pxl$partit2 <- factor(interaction(s1f, s2f), labels = lbl)

n_h <- rep(1, times = 16)
set.seed(314)
units <- sampling::strata(pxl, stratanames = "partit2",
                size = n_h, method = "srswor")
myfinpop <- getdata(pxl, units)
#remove some units, and shift fourth point into circle
myfinpop$s1[4] <- 78
units <- sample(nrow(myfinpop), 2)
myfinpop <- myfinpop[-units, ]
N <- nrow(myfinpop)

circle <- data.frame(s1 = 50, s2 = 50)
plt1 <- ggplot(data = myfinpop) +
  geom_raster(data = pxl, mapping = aes(x = s1, y = s2), fill = "lightgrey") +
  geom_circle(data = circle, aes(x0 = s1, y0 = s2, r = 50), fill = "grey") +
  geom_point(mapping = aes(x = s1, y = s2), size = 2) +
  scale_x_continuous(name = "Easting") +
  scale_y_continuous(name = "Northing") +
  scale_fill_discrete(guide = "none") +
  coord_fixed()

dat <- expand.grid(x = seq(from = 12.5, to = 87.5, by = 25), y = seq(from = 12.5, to = 87.5, by = 25))
dat$labels <- lbl

plt2 <- ggplot(data = myfinpop) +
  geom_raster(data = pxl, mapping = aes(x = s1, y = s2, fill = as.factor(partit1)), alpha = 0.5) +
  geom_circle(data = circle, mapping = aes(x0 = s1, y0 = s2, r = 50)) +
  geom_vline(xintercept = c(25, 50, 75)) +
  geom_hline(yintercept = c(25, 50, 75)) +
  geom_point(mapping = aes(x = s1, y = s2), size = 2) +
  geom_text(data = dat, mapping = aes(x = x, y = y, label = labels)) +
  scale_fill_viridis_d(guide = "none") +
  scale_x_continuous(name = "Easting") +
  scale_y_continuous(name = "Northing") +
  coord_fixed()

grid.arrange(plt1, plt2, nrow = 1)
```

The next step is to place the 16 subcells on a line in a random order. The randomisation is done hierarchically. First, the four grid cells at the highest level are randomised. In our example, the randomised order is 1, 2, 3, 0 (Figure \@ref(fig:GRTS)). Next, within each grid cell, the order of the subcells is randomised. This is done independently for the grid cells. In our example, for grid cell 1 the randomised order of the subcells is 2, 1, 3, 0 (Figure \@ref(fig:GRTS)). Note that the empty subcells (0,0) and (3,3) are removed from the line.

```{r}
set.seed(314)
ord <- sample(4, 4)
myfinpop_rand <- NULL
for (i in ord) {
  units <- which(myfinpop$partit1 == i)
  units_rand <- sample(units, size = length(units))
  myfinpop_rand <- rbind(myfinpop_rand, myfinpop[units_rand, ])
}
```

After the subcells have been placed on a line, a one-dimensional systematic random sample is selected (Figure \@ref(fig:GRTS)), see also Subsection \@ref(Systematicpps). This can be done either with equal or unequal inclusion probabilities. With equal inclusion probabilities, the length of the intervals representing the population units is constant. With unequal inclusion probabilities, the interval lengths are proportional to a size variable. For a sample size of $n$, the total line is divided into $n$ segments of equal length. A random point is selected in the first segment, and the other points of the systematic sample are determined. Finally, the population units corresponding to the selected systematic sample are identified. With equal probabilities, the five selected units are the units in subcells 11, 23, 22, 32, and 03 (Figure \@ref(fig:GRTS)). 

```{r sys1D}
size <- rep(1, N)
n <- 5
interval <- sum(size) / n
start <- round(runif(1) * interval, 2)
mysys <- c(start, 1:(n - 1) * interval + start)
```

```{r GRTS, echo = FALSE, fig.asp = .15, out.width = "100%", fig.cap = "Systematic random sample along a line with equal inclusion probabilities."}
cellbound <- c(0, cumsum(size))
#compute width of cells
w <- diff(cellbound)
#compute x coordinates halfway the cell boundaries
x <- cellbound[1:N] + w / 2

n1 <- table(myfinpop_rand$partit1)
address_1 <- rep(ord, n1[ord])
address <- myfinpop_rand$partit2

datfr <- data.frame(x, y = rep(1, N), address_1, address, w)
ggplot(data = datfr) +
  geom_tile(mapping = aes(x = x, y = y, fill = factor(address_1), width = w), alpha = 0.5) +
  geom_text(mapping = aes(x = x, y = y, label = address)) +
  geom_vline(xintercept = cellbound, linetype = 2) +
  scale_y_continuous("", breaks = c()) +
  scale_x_continuous("", breaks = mysys) +
  scale_fill_viridis_d(guide = "none") +
  theme(panel.background = element_blank(),
    panel.border = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.background = element_blank())
```
Figure \@ref(fig:GRTSpps) shows a systematic random sample along a line with unequal inclusion probabilities. The inclusion probabilities are proportional to a size variable, with values 1, 2, 3, or 4. The selected population units are the units in subcells 10, 20, 31, 01, and 02.

```{r GRTSpps, echo = FALSE, fig.asp = .15, out.width = "100%", fig.cap = "Systematic random sample along a line with inclusion probabilities proportional to size."}
size <- sample(4, size = N, replace = TRUE)
n <- 5
interval <- sum(size) / n
start <- round(runif(1) * interval, 2)
mysyspps <- c(start, 1:(n - 1) * interval + start)

cellbound <- c(0, cumsum(size))
w <- diff(cellbound)
x <- cellbound[1:N] + w / 2
datfr <- data.frame(x, y = rep(1, N), address_1, address, w)
ggplot(data = datfr) +
  geom_tile(mapping = aes(x = x, y = y, fill = factor(address_1), width = w), alpha = 0.5) +
  geom_text(mapping = aes(x = x, y = y, label = address)) +
  geom_vline(xintercept = cellbound, linetype = 2) +
  scale_y_continuous("", breaks = c()) +
  scale_x_continuous("", breaks = mysyspps) +
  scale_fill_viridis_d(guide = "none") +
  theme(panel.background = element_blank(),
    panel.border = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.background = element_blank())

```

GRTS samples can be selected with function `grts` of package **spsurvey** [@spsurvey]. The next code chunk shows the selection of a GRTS sample of 40 units from Kandahar. Tibble `grdKandahar` is first converted  to an `sf` object with function `st_as_sf` of package **sf** [@sf]. The data set is using a UTM projection (zone 41N) with WGS84 datum. This projection is passed to function `st_as_df` with argument `crs` (`crs = 32641`). Argument `sframe` of function `grts` specifies the sampling frame. The sample size is passed to function `grts` with argument `n_base`. Argument `seltype` is set to `proportional` to select units with probabilities proportional to an ancillary variable which is passed to function `grts` with argument `aux_var`. 

```{r GRTSppsKandahar}
library(spsurvey)
library(sf)
sframe_sf <- st_as_sf(grdKandahar, coords = c("s1", "s2"), crs = 32641)
set.seed(314)
res <- grts(
  sframe = sframe_sf, n_base = 40, seltype = "proportional", aux_var = "agri")
myGRTSsample <- res$sites_base
```

Function `cont_analysis` computes the ratio estimator of the population mean and its standard error. The estimated mean is multiplied by the total number of population units to obtain a ratio estimate of the population total.

```{r}
myGRTSsample$siteID <- paste0("siteID", seq_len(nrow(myGRTSsample)))
res <- cont_analysis(myGRTSsample,
       vars = "poppy", siteID = "siteID", weight = "wgt", statistics = "Mean")
tz_ratio <- res$Mean$Estimate * nrow(grdKandahar)
se_tz_ratio <- res$Mean$StdError * nrow(grdKandahar)
```

The estimated total poppy area is `r formatC(tz_ratio, 0, format = "f", big.mark = ",")` ha, and the estimated standard error is `r formatC(se_tz_ratio, 0, format = "f", big.mark = ",")`  ha.

The alternative is to estimate the total poppy area by the $\pi$ estimator. Function `vsb` of package **BalancedSampling** can be used to estimate the standard error of the $\pi$ estimator.

```{r}
tz_HT <- sum(myGRTSsample$poppy / myGRTSsample$ip)
Xsample <- st_coordinates(myGRTSsample)
se_tz_HT <- sqrt(vsb(myGRTSsample$ip, myGRTSsample$poppy, Xsample))
```

The estimated total is `r formatC(tz_HT, 0, format = "f", big.mark = ",")`  ha, and the estimated standard error is `r formatC(se_tz_HT, 0, format = "f", big.mark = ",")` ha. 

## Balanced sampling with spreading {#BalancedandSpreaded}

As mentioned in the introduction to this chapter, a sample balanced on a covariate still may have a poor spreading along the axis spanned by the covariate. @Grafstrom2013 presented a method for selecting balanced samples that are also well-spread in the space spanned by the covariates, which they refer to as doubly balanced sampling\index{Doubly balanced sampling}. If we take one or more covariates as balancing variables and, besides, Easting and Northing as spreading variables, this leads to balanced samples with good *geographical* spreading. When the residuals of the regression model show spatial structure (are spatially correlated), the estimated population mean of the study variable becomes more precise thanks to the improved geographical spreading. Balanced samples with spreading can be selected with function `lcube` of package **BalancedSampling**. This is illustrated with Eastern Amazonia, using as before lnSWIR2 for balancing the sample.  

```{r lcube}
library(BalancedSampling)
N <- nrow(grdAmazonia)
n <- 100
Xbal <- cbind(rep(1, times = N), grdAmazonia$lnSWIR2)
Xspread <- cbind(grdAmazonia$x1, grdAmazonia$x2)
pi <- rep(n / N, times = N)
set.seed(314)
units <- lcube(Xbal = Xbal, Xspread = Xspread, prob = pi)
mysample <- grdAmazonia[units, ]
```

```{r DoublyBalancedSampleAmazonia, echo = FALSE, out.width = '100%', fig.cap = "Sample balanced on lnSWIR2 with geographical spreading from Eastern Amazonia."}
ggplot(data = grdAmazonia) +
  geom_raster(mapping = aes(x = x1 / 1000, y = x2 / 1000, fill = lnSWIR2)) +
  geom_point(data = mysample, mapping = aes(x = x1 / 1000, y = x2 / 1000), size = 2) +
  scale_fill_viridis_c(name = "lnSWIR2") +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  coord_fixed()
```

The selected sample is shown in Figure \@ref(fig:DoublyBalancedSampleAmazonia). Comparing this sample with the balanced sample of Figure \@ref(fig:BalancedSampleAmazonia) shows that the geographical spreading of the sample is improved, although there still are some close points. I used equal inclusion probabilities, so the $\pi$ estimate of the mean is equal to the sample mean, which is equal to `r formatC(mean(mysample$AGB), 1, format = "f")` 10^9^ kg ha^-1^.

The variance of the $\pi$ estimator of the mean can be estimated by (equation (7) of @Grafstrom2013)

\begin{equation}
\widehat{V}(\hat{\bar{z}}) = \frac{n}{n-p}\frac{p}{p+1} \sum_{k \in \mathcal{S}}(1-\pi_k) \left(\frac{e_k}{\pi_k}-\bar{e}_k \right)^2 \;,
(\#eq:VarmeanDoublyBalanced)
\end{equation}

with $p$ the number of balancing variables, $e_k$ the regression model residual of unit $k$ (Equation \@ref(eq:residualsBalanced)), and $\bar{e}_k$ the local mean of the residuals of this unit, computed by

\begin{equation}
\bar{e}_k = \frac{\sum_{j=1}^{p+1}(1-\pi_j)\frac{e_j}{\pi_j}}{\sum_{j=1}^{p+1}(1-\pi_j)}\;.
(\#eq:localmeanresdual)
\end{equation}

The variance estimator can be computed with functions `localmean_weight` and `localmean_var` of package **spsurvey** [@spsurvey].

```{r}
library(spsurvey)
pi <- rep(n / N, n)
c <- (1 - pi)
b <- estimate_b(z = mysample$AGB / pi, X = Xbal[units, ] / pi, c = c)
zpred <- Xbal %*% b
e <- mysample$AGB - zpred[units]
weights <- localmean_weight(x = mysample$x1, y = mysample$x2, prb = pi, nbh = 3)
v_mz <- localmean_var(z = e / pi, weight_1st = weights) / N^2
```

```{r, echo = FALSE, eval = FALSE}
#Alternative for variance estimation, but nbh equals 4 and cannot be changed
mysample$e <- mysample$AGB - zpred[units]
mysample$weight <- 1 / pi
mysample$siteID <- paste0("siteID", 1:nrow(mysample))
res <- cont_analysis(mysample, vars = "e", siteID = "siteID", weight = "weight", xcoord = "x1", ycoord = "x2", statistics = "Mean")
v_mz <- res$Mean$StdError
```

The estimated standard error is `r formatC(sqrt(v_mz), 1, format = "f")` 10^9^ kg ha^-1^, which is considerably smaller than the estimated standard error of the balanced sample without geographical spreading.

```{r, echo = FALSE}
rm(list = ls())
```