-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathspatiotemporal-interpolation.Rmd
639 lines (540 loc) · 30 KB
/
spatiotemporal-interpolation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
# Spatiotemporal interpolation using Ensemble ML {#spatiotemporal-ml}
```{r, results = "asis", echo = FALSE}
status("drafting")
```
```{r, include=FALSE, message=FALSE, results='hide'}
ls <- c("rgdal", "raster", "plotKML", "ranger", "mlr", "forestError",
"xgboost", "glmnet", "matrixStats", "landmap", "yardstick", "Cubist",
"hexbin", "parallelMap", "Metrics", "fastSave", "devtools")
new.packages <- ls[!(ls %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "https://cloud.r-project.org")
lapply(ls, require, character.only = TRUE)
#load.pigz("eml_data.RData")
source("PSM_functions.R")
```
## Spatiotemporal interpolation of daily temperatures
In previous examples we have demonstrated effects of over-fitting and
how Ensemble ML helps decrease overfitting and extrapolation problems
using synthetic data. We can now look at some real-life cases for example
the daily temperatures measured for several years for Croatia described in @hengl2012spatio.
This data sets consist of two parts: (1) measurements of daily temperatures at
meteo stations, (2) list of gridded covariates (Fig. \@ref(fig:croatia-meteo)).
```{r croatia-meteo, echo = FALSE, out.width = "100%", fig.cap = "Temporal dynamics of mean-daily temperatures at sample meteorological stations. This shows seasonality effects (smoothed line) and daily oscillations."}
knitr::include_graphics("./img/Fig_stations_meteo_croatia.png")
```
We can load the point data by using:
```{r}
library(rgdal)
hrmeteo = readRDS("input/hrtemp2006_meteo.rds")
str(hrmeteo)
idsta.pnts = hrmeteo$stations
coordinates(idsta.pnts) = ~ X + Y
```
This is a typical format for spatiotemporal meteorological data with locations
of stations in one table, and measurements of daily temperatures (`MDTEMP`) in
other. The column `cday` here is the cumulative day since 1970, which allows us
to present time on a linear scale i.e. by using a numeric value instead of dates.
The gridded data includes: (a) static covariates (relief data), and
(b) dynamic time-series data (MODIS LST). To load the static covariates we use:
```{r}
hrgrid1km = readRDS("input/hrgrid1km.rds")
#plot(hrgrid1km[1])
proj4string(idsta.pnts) = proj4string(hrgrid1km)
str(hrgrid1km@data)
```
The dynamic time-series data is stored in a local folder (`input/LST2006HR`) as
individual files that we can list by using:
```{r}
LST.listday <- dir("input/LST2006HR", pattern=glob2rx("LST2006_**_**.LST_Day_1km.tif"), full.names = TRUE)
LST.listnight <- dir("input/LST2006HR", pattern=glob2rx("LST2006_**_**.LST_Night_1km.tif"), full.names = TRUE)
str(LST.listday)
```
Here we see there are 46 images for year 2006 with daytime and 46 images for
night time estimates of LST. We do not want to load all these rasters to R
because we might experience RAM problems. We can first overlay points and see
which variables can help with mapping daily temperatures.
For the static covariates we only have to run the overlay once:
```{r}
idsta.ov <- sp::over(idsta.pnts, hrgrid1km)
idsta.ov$IDSTA = idsta.pnts$IDSTA
str(idsta.ov)
```
For the spatiotemporal data (MODIS LST time-series) we need to run overlay as in
a spacetime cube. This means that we need to match points using `x,y,t`
coordinates with grids covering the same `x,y,t` fields. To speed up spacetime overlay
we use our custom function `extract_st`, which basically builds on top of the
`terra` package. First, we need to define begin, end times for each GeoTIFF:
```{r, cache=TRUE}
library(terra)
source("mlst_functions.R")
hrmeteo$meteo$x = plyr::join(hrmeteo$meteo, hrmeteo$stations, by="IDSTA")$X
hrmeteo$meteo$y = plyr::join(hrmeteo$meteo, hrmeteo$stations, by="IDSTA")$Y
## generate row ID:
hrmeteo$meteo$row.id = 1:nrow(hrmeteo$meteo)
hrmeteo$meteo$Date = as.Date(hrmeteo$meteo$DATE, format = "%Y-%m-%d")
## strip dates from filename:
begin.tif1.lst = as.Date(paste0("2006-", substr(basename(LST.listday), 9, 10),
"-", substr(basename(LST.listday), 12, 13)))-4
end.tif1.lst = as.Date(paste0("2006-", substr(basename(LST.listday), 9, 10),
"-", substr(basename(LST.listday), 12, 13)))+4
```
now that we know spacetime coordinates for both points and grids, we can run
overlay in parallel to speed up computing:
```{r, cache=TRUE}
mc.cores = parallel::detectCores()
ov.pnts <- parallel::mclapply(1:length(LST.listday), function(i){
extract_st(tif=LST.listday[i], hrmeteo$meteo, date="Date",
crs = proj4string(hrgrid1km),
date.tif.begin=begin.tif1.lst[i],
date.tif.end=end.tif1.lst[i],
coords=c("x","y"), variable.name="LST.day") },
mc.cores=mc.cores)
ov.pnts = ov.pnts[!sapply(ov.pnts, is.null)]
ov.tifs1 = plyr::join_all(ov.pnts, by="row.id", type="full")
str(ov.tifs1)
ov.tifs1$LST.day = ifelse(ov.tifs1$LST.day == 0, NA, ov.tifs1$LST.day)
```
In this case we also exclude the values of `LST.day` are equal to 0 as these
are basically missing values in the GeoTIFFs. We repeat the same overlay operation
for the night light images:
```{r, cache=TRUE}
begin.tif2.lst = as.Date(paste0("2006-", substr(basename(LST.listnight), 9, 10),
"-", substr(basename(LST.listnight), 12, 13)))-4
end.tif2.lst = as.Date(paste0("2006-", substr(basename(LST.listnight), 9, 10),
"-", substr(basename(LST.listnight), 12, 13)))+4
ov.pnts <- parallel::mclapply(1:length(LST.listnight), function(i){
extract_st(tif=LST.listnight[i], hrmeteo$meteo, date="Date",
crs = proj4string(hrgrid1km),
date.tif.begin=begin.tif2.lst[i],
date.tif.end=end.tif2.lst[i],
coords=c("x","y"), variable.name="LST.night") },
mc.cores=mc.cores)
ov.pnts = ov.pnts[!sapply(ov.pnts, is.null)]
ov.tifs2 = plyr::join_all(ov.pnts, by="row.id", type="full")
str(ov.tifs2)
ov.tifs2$LST.night = ifelse(ov.tifs2$LST.night == 0, NA, ov.tifs2$LST.night)
```
The result of spacetime overlay is a simple long table matching exactly the meteo-data table.
We next bind results of overlay using static and dynamic covariates:
```{r, cache=TRUE}
hrmeteo.rm = plyr::join_all(list(hrmeteo$meteo, ov.tifs1, ov.tifs2))
hrmeteo.rm = plyr::join(hrmeteo.rm, idsta.ov)
```
we also add the geometric component of temperature based on the sphere formulas:
```{r, cache=TRUE}
hrmeteo.rm$temp.mean <- temp.from.geom(fi=hrmeteo.rm$Lat,
as.numeric(strftime(hrmeteo.rm$Date, format = "%j")),
a=37.03043, b=-15.43029, elev=hrmeteo.rm$HRdem, t.grad=0.6)
```
We have now produced a **spatiotemporal regression matrix** that can be used to fit
a prediction model for daily temperature. The model is of form:
```{r}
fm.tmp <- MDTEMP ~ temp.mean + LST.day + LST.night + HRdsea
```
We next fit an Ensemble ML using the same process described in the previous sections:
```{r, cache=TRUE}
library(mlr)
lrn.rf = mlr::makeLearner("regr.ranger", num.trees=150, importance="impurity",
num.threads = parallel::detectCores())
lrns.st <- list(lrn.rf, mlr::makeLearner("regr.nnet"), mlr::makeLearner("regr.gamboost"))
sel = complete.cases(hrmeteo.rm[,all.vars(fm.tmp)])
hrmeteo.rm = hrmeteo.rm[sel,]
#summary(sel)
subs <- runif(nrow(hrmeteo.rm))<.2
tsk0.st <- mlr::makeRegrTask(data = hrmeteo.rm[subs,all.vars(fm.tmp)],
target = "MDTEMP", blocking = as.factor(hrmeteo.rm$IDSTA[subs]))
tsk0.st
```
Train model using a subset of points:
```{r, cache=TRUE, warning=FALSE, message=FALSE}
init.TMP <- mlr::makeStackedLearner(lrns.st, method="stack.cv", super.learner="regr.lm",
resampling=mlr::makeResampleDesc(method="CV", blocking.cv=TRUE))
parallelMap::parallelStartSocket(parallel::detectCores())
eml.TMP = train(init.TMP, tsk0.st)
parallelMap::parallelStop()
```
This shows that daily temperatures can be predicted with relatively high R-square,
although the residual values are still significant (ranging from -1.8 to 1.8 degrees):
```{r}
summary(eml.TMP$learner.model$super.model$learner.model)
```
The variable importance analysis shows that the most important variable for predicting
daily temperatures is, in fact, the night-time MODIS LST:
```{r var-imptemp, echo=TRUE, fig.width=5, out.width="90%", fig.cap="Variable importance for modeling spacetime daily temperatures."}
library(ggplot2)
xl <- as.data.frame(mlr::getFeatureImportance(eml.TMP[["learner.model"]][["base.models"]][[1]])$res)
xl$relative_importance = 100*xl$importance/sum(xl$importance)
xl = xl[order(xl$relative_importance, decreasing = TRUE),]
xl$variable = paste0(c(1:length(xl$variable)), ". ", xl$variable)
ggplot(data = xl[1:4,], aes(x = reorder(variable, relative_importance), y = relative_importance)) +
geom_bar(fill = "steelblue",
stat = "identity") +
coord_flip() +
labs(title = "Variable importance",
x = NULL,
y = NULL) +
theme_bw() + theme(text = element_text(size=15))
```
We can use the **fitted spacetime EML model** to generate predictions e.g. for
four consecutive days in August. First, we import MODIS LST for month of interest:
```{r, cache=TRUE}
hrpred1km = hrgrid1km
sel.tifs1 = LST.listday[grep("_08_", LST.listday)]
sel.tifs2 = LST.listnight[grep("_08_", LST.listnight)]
## read to R in parallel
x1 = as.data.frame( parallel::mclapply(sel.tifs1,
function(i){x <- readGDAL(i)$band1; x <- ifelse(x<1, NA, x); return(x)},
mc.cores = mc.cores))
x2 = as.data.frame( parallel::mclapply(sel.tifs2,
function(i){x <- readGDAL(i)$band1; x <- ifelse(x<1, NA, x); return(x)},
mc.cores = mc.cores))
names(x1) <- basename(sel.tifs1); names(x2) <- basename(sel.tifs2)
```
Second, we interpolate values between 8--day periods and fill gaps in EO data
using simple linear interpolation (MODIS images are available only every 8 days):
```{r, warning=FALSE, message=FALSE}
dates.lst = as.Date("2006-08-13")+1:8
in.dates = c("2006-08-05", "2006-08-13", "2006-08-21", "2006-08-29")
in.days = as.numeric(strftime(as.Date(c(in.dates)), format = "%j"))
## interpolate values for missing dates in spacetime
library(parallel)
cl <- makeCluster(detectCores())
clusterExport(cl, c("in.days", "dates.lst"))
t1s = parallel::parApply(cl, x1, 1,
function(y) { try( approx(in.days, as.vector(y), xout=as.numeric(strftime(dates.lst, format = "%j")))$y ) })
t2s = parallel::parApply(cl, x2, 1,
function(y) { try( approx(in.days, as.vector(y), xout=as.numeric(strftime(dates.lst, format = "%j")))$y ) })
stopCluster(cl)
## remove missing pixels
x.t1s = parallel::mclapply(t1s, length, mc.cores = mc.cores)
t1s[which(!x.t1s==8)] <- list(rep(NA, 8))
t1s = do.call(rbind.data.frame, t1s)
names(t1s) = paste0("LST.day_", dates.lst)
x.t2s = parallel::mclapply(t2s, length, mc.cores = mc.cores)
t2s[which(!x.t2s==8)] <- list(rep(NA, 8))
t2s = do.call(rbind.data.frame, t2s)
names(t2s) = paste0("LST.night_", dates.lst)
```
Now we can make predictions for the target days in August 2006 by using (we run
this operation in a loop to avoid RAM overload):
```{r, warning=FALSE, message=FALSE}
for(j in paste(dates.lst)){
out.tif = paste0("output/MDTEMP_", j, ".tif")
if(!file.exists(out.tif)){
hrpred1km@data[,"LST.day"] = t1s[,paste0("LST.day_", j)]
hrpred1km@data[,"LST.night"] = t2s[,paste0("LST.night_", j)]
hrpred1km$temp.mean = temp.from.geom(fi=hrpred1km$Lat,
as.numeric(strftime(as.Date(j), format = "%j")),
a=37.03043, b=-15.43029, elev=hrpred1km$HRdem, t.grad=0.6)
sel.pix = complete.cases(hrpred1km@data[,eml.TMP$features])
out = predict(eml.TMP, newdata=hrpred1km@data[sel.pix,eml.TMP$features])
hrpred1km@data[,paste0("MDTEMP_", j)] = NA
hrpred1km@data[sel.pix, make.names(paste0("MDTEMP_", j))] = out$data$response * 10
writeGDAL(hrpred1km[make.names(paste0("MDTEMP_", j))], out.tif, mvFlag = -32768,
type = "Int16", options = c("COMPRESS=DEFLATE"))
} else {
hrpred1km@data[,make.names(paste0("MDTEMP_", j))] = readGDAL(out.tif)$band1
}
}
```
To plot these predictions we can either put predictions in the `spacetime` package
class (see [gstat tutorial](https://cran.r-project.org/web/packages/gstat/vignettes/spatio-temporal-kriging.pdf)), or simply plot them using `sp` package:
```{r, st-plottemp, fig.width=8, out.width="100%", fig.cap="Predictions spacetime daily temperature for August 2006."}
st.pts = list("sp.points", idsta.pnts, pch = "+", col="black")
spplot(hrpred1km[make.names(paste0("MDTEMP_", dates.lst[c(1,4,8)]))],
col.regions=plotKML::R_pal[["rainbow_75"]][4:20],
at = seq(143, 239, length.out=17),
sp.layout = list(st.pts),
main="Prediction daily temperature")
#dev.off()
```
In summary, this example shows how to fit spatiotemporal EML with using
also seasonality component together with the EO data. It can hence
be considered a *complete framework* for spatiotemporal interpolation as both static,
dynamic covariates and latitude / elevation are used for model training.
## Spatiotemporal distribution of Fagus sylvatica
In the next example we show how to fit a spatiotemporal model
using biological data: occurrences of [*Fagus
sylvatica*](https://www.gbif.org/species/2882316) over Europe. This is
the domain of **Species Distribution modeling**, except in this case we
model distribution of target species also in spacetime. The training (point) data has
been compiled for the purpose of the OpenDataScience.eu project, then
cleaned and overlaid vs time-series of [Landsat GLAD](https://gitlab.com/geoharmonizer_inea/eumap) images and climatic
variables [@witjes2021spatiotemporal; @Bonannella2022].
For more details about the data refer also to the [eumap repository](https://gitlab.com/geoharmonizer_inea/eumap).
We can load a snapshot of data by using:
```{r, message=FALSE}
library(data.table)
library(dplyr)
library(mlr)
library(sp)
fs.rm = readRDS('input/fagus_sylvatica_st.rds')
occ.pnts = fs.rm[,c("Atlas_class","easting","northing")]
coordinates(occ.pnts) = ~ easting + northing
proj4string(occ.pnts) = "+init=epsg:3035"
occ.pnts = spTransform(occ.pnts, CRS("+init=epsg:4326"))
```
This is a subset of a [larger
dataset](https://doi.org/10.5281/zenodo.5818021) that
has been used to produce predictions of distribution of key forest tree
species for Europe (you can browse the data via <https://maps.opendatascience.eu>).
The first columns of this dataset show:
```{r}
head(fs.rm[,1:10])
```
The header columns are:
- `id`: is the unique ID of each point;
- `year`: is the year of obsevation;
- `postprocess`: column can have value yearly or spacetime to identify
if the temporal reference of an observation comes from the original
dataset or is the result of post-processing (yearly for originals,
spacetime for post-processed points);
- `Tile_ID`: is as extracted from the 30-km tiling system;
- `easting`: is the easting coordinate of the observation point;
- `northing`: is the northing coordinate of the observation point;
- `Atlas_class`: contains name of the tree species or NULL if it's an
absence point coming from LUCAS;
- `lc1`: contains original LUCAS land cover classes or NULL if it's a
presence point;
Other columns are the EO and ecological covariates that we use for
modeling distribution of [*Fagus sylvatica*](https://www.gbif.org/species/2882316). We can plot distribution of points over EU using:
```{r map-fs, echo=TRUE, fig.width=8, out.width="100%", fig.cap="Distribution of occurrence locations for Fagus sylvatica. Each training points is also referrenced in time and hence can be used to run spacetime overlay."}
library(rnaturalearth)
library(raster)
europe <- rnaturalearth::ne_countries(scale=10, continent = 'europe')
europe <- raster::crop(europe, extent(-24.8,35.2,31,68.5))
op = par(mar=c(0,0,0,0))
plot(europe, col="lightgrey", border="darkgrey", axes=FALSE)
points(occ.pnts[occ.pnts$Atlas_class==1,], pch="+", cex=.8)
par(op)
```
As in previous examples, we first define the target model formula. We
remove from model all other columns that are not used for prediction:
```{r}
covs = grep("id|year|postprocess|Tile_ID|easting|northing|Atlas_class|lc1",
colnames(fs.rm), value = TRUE, invert = TRUE)
fm.fs = stats::as.formula(paste("Atlas_class ~ ", paste(covs, collapse="+")))
fs.rm$Atlas_class = factor(fs.rm$Atlas_class)
all.vars(fm.fs)[1:5]
```
To speed-up fitting of the models we have prepared two functions that wrap all modeling steps. First step
consists of fine-tuning the base learners:
```{r, message=FALSE, warning=FALSE}
source("mlst_functions.R")
fs.rm0 = fs.rm %>% sample_n(5000)
tnd.ml = tune_learners(data = fs.rm0, formula = fm.fs, blocking = factor(fs.rm$Tile_ID))
```
This function runs hyperparameter optimization on binary classification models (`classif.ranger`, `classif.xgboost`, `classif.glmnet`) which will be later used to create an ensemble model.
The intermediate models (fine-tuned RF and XGboost) are written to local folder `output` as RDS files.
The output of the function is a RDS file including all the fine-tuned models and the formula.
The second step is the fitting of the ensemble model based on stacked regularization:
```{r, message=FALSE, warning=FALSE}
eml.fs = train_sp_eml(data = fs.rm0, tune_result = tnd.ml, blocking = as.factor(fs.rm$Tile_ID))
```
The function fits separately each of the previously used classification models: each algorithm
will classify independently the probability of any training point being classified `0`
(not-occurring) or `1` (occurring). The probability outputs of each model are then used as training dataset
for the meta-learner. We use a GLM model with binomial link function (logistic regression):
```{r}
#eml.fs = readRDS("output/eml_model.rds")
summary(eml.fs$learner.model$super.model$learner.model)
```
The variable importance analysis (RF component) shows that the most
important covariates for mapping distribution of Fagus sylvatica are
landsat images (which is expected considering the high spatial resolution)
and the spatial distribution of other tree species (tree species commonly found
together with Fagus sylvatica in the field):
```{r varimp-st, echo=TRUE, fig.width=8, out.width="90%", fig.cap="Variable importance for spatiotemporal model used to predict distribution of Fagus sylvatica."}
library(ggplot2)
xl <- as.data.frame(mlr::getFeatureImportance(eml.fs[["learner.model"]][["base.models"]][[1]])$res)
xl$relative_importance = 100*xl$importance/sum(xl$importance)
xl = xl[order(xl$relative_importance, decreasing = T),]
xl$variable = paste0(c(1:238), ". ", xl$variable)
ggplot(data = xl[1:20,], aes(x = reorder(variable, relative_importance), y = relative_importance)) +
geom_bar(fill = "steelblue",
stat = "identity") +
coord_flip() +
labs(title = "Variable importance",
x = NULL,
y = NULL) +
theme_bw() + theme(text = element_text(size=10))
```
To produce spacetime predictions for some tiles (30-m spatial
resolution) we can run:
```{r, eval=FALSE, message=FALSE}
m1 = predict_tiles(input = "9690.2012", model = eml.fs)
m2 = predict_tiles(input = "9690.2016", model = eml.fs)
m3 = predict_tiles(input = "9690.2020", model = eml.fs)
m1$Prob.2012 = m1$Prob
m1$Prob.2016 = m2$Prob
m1$Prob.2020 = m3$Prob
```
We can compare predictions of the probability of occurrence of the
target species for two years next to each other by using:
```{r predictions-fs, eval=FALSE, echo=TRUE, fig.width=8, out.width="100%", fig.cap="Predicted probability of occurrence for Fagus sylvatica for three periods."}
pts = list("sp.points", spTransform(occ.pnts[occ.pnts$Atlas_class==1,], CRS("+init=epsg:3035")),
pch = "+", col="black")
spplot(m1[,c("Prob.2012","Prob.2016","Prob.2020")], names.attr = c("Prob.2010-2014", "Prob.2014-2018", "Prob.2018-2020"), col.regions=rev(bpy.colors()),
sp.layout = list(pts),
main="Realized distribution of Fagus sylvatica")
```
In this case there seems to be no drastic changes in the distribution of
the target species through time, which is also expected because forest species distribution
change on a scale of 50 to 100 year and not on a scale of few years.
Some changes in distribution of the species, however, can be detected nevertheless.
These can be due to abrupt events such as pest-pandemics, fires, floods or landslides
or clear cutting of forests of course.
We can also plot the images using the `plotKML` package so that we can open and
visualize predictions also in Google Earth or similar:
```{r, eval=FALSE}
library(plotKML)
for(j in c(2012,2016,2020)){
kml(m1, colour=m1@data[,paste0("Prob.", j)], file.name=paste0("prob.", j, ".kml"),
raster_name = paste0("prob.", j, ".png"),
colour_scale = SAGA_pal[["SG_COLORS_YELLOW_BLUE"]],
z.lim=c(0,100),
TimeSpan.begin = as.Date(paste0(j, "-01-01")), TimeSpan.end = as.Date(paste0(j, "-12-31")))
}
```
In this case we attach to each prediction also the `TimeSpan.begin` and `TimeSpan.end`
which means that Google Earth will recognize the temporal reference of the predictions.
Opening predictions in Google Earth allows us to do some interpretation of produced maps and also
analyze how much are the changes in vegetation cover connected with relief,
proximity to urban areas, possible fire / flood events and similar.
```{r fagus-ge, echo = FALSE, out.width = "100%", fig.cap = "Spacetime predictions of distribution of Fagus Sylvatica visualized as time-series data in Google Earth."}
knitr::include_graphics("./img/Fig_google_earth_prob_ts.png")
```
We can check how the ensemble model performs in the classification task compared to the
component models by running a resampling operation for each learner and the meta-learner.
In this example we use a 5-fold spatial cross validation repeated 5 times. We first create new learners
and assign the hyperparameters obtained from the fine-tuning operation:
```{r lrns.acc.assessment}
learners = list(makeLearner("classif.ranger", predict.type = "prob", num.trees = 85),
makeLearner("classif.xgboost", predict.type = "prob"),
makeLearner("classif.glmnet", predict.type = "prob"))
learners[[1]] = mlr::setHyperPars(learners[[1]], par.vals = mlr::getHyperPars(tnd.ml[[1]]$learner))
learners[[2]] = mlr::setHyperPars(learners[[2]], par.vals = mlr::getHyperPars(tnd.ml[[2]]$learner))
learners[[3]] = mlr::setHyperPars(learners[[3]], par.vals = list(s = min(tnd.ml[[3]]$learner.model$lambda)))
```
Then we create the classification task (previously included in the functions), select which
metric we want to evaluate our models on and resampling strategy. We are using logarithmic loss as
performance metric for this example. After that, everything is set for the resampling operation, which
we run in parallel:
```{r mlr.framework, echo=FALSE}
tsk <- mlr::makeClassifTask(data = fs.rm0[,c(7,9:length(fs.rm0))],
target = all.vars(tnd.ml[[4]][[2]]),
blocking = factor(fs.rm0$Tile_ID),
positive = "1")
mser = list(logloss)
rdesc = makeResampleDesc("RepCV", reps = 5, folds = 5, blocking.cv = T)
set.seed(123, "L'Ecuyer")
parallelMap::parallelStartSocket(parallel::detectCores())
out.rf = "output/rf_resample.rds"
out.xg = "output/xgb_resample.rds"
out.glm = "output/glm_resample.rds"
rf.rsmp = mlr::resample(learners[[1]], tsk, rdesc, mser, keep.pred = TRUE, models = F)
saveRDS(rf.rsmp, out.rf)
xg.rsmp = mlr::resample(learners[[2]], tsk, rdesc, mser, keep.pred = TRUE, models = F)
saveRDS(xg.rsmp, out.xg)
glm.rsmp = mlr::resample(learners[[3]], tsk, rdesc, mser, keep.pred = TRUE, models = F)
saveRDS(glm.rsmp, out.glm)
```
For the meta-learner, we access the predicted probabilities of the component models stored inside
the ensemble model and use them as training data for the meta-learner in the resampling task. We
need to create an additional task:
```{r meta-learner resampling, echo=FALSE}
eml.task = mlr::makeClassifTask(data = eml.fs$learner.model$super.model$learner.model$data,
target = all.vars(tnd.ml[[4]][[2]]),
blocking = factor(fs.rm0$Tile_ID),
positive = "1")
out.eml = "output/eml_resample.rds"
eml.rsmp = mlr::resample(mlr::makeLearner("classif.logreg", predict.type = "prob"),
eml.task, rdesc, mser, keep.pred = T, models = F)
saveRDS(eml.rsmp, out.eml)
parallelMap::parallelStop()
```
Logloss is a robust metric when assessing performances of classification algorithms in the
probability space. However, it is less intuitive to represent. To better compare performances across
models, we standardize the logloss values according to the following equation:
$R^2_\text{logloss} = 1 -\frac{Logloss_\text{m}}{Logloss_\text{r}}$
Where $Logloss_\text{m}$ is the logloss value of the model and $Logloss_\text{r}$ is
the logloss value of a model that performs no better than a random guess. We use the value of
$Logloss_\text{r}$ as a baseline for predictive performances. Values of $R^2_\text{logloss}$ close to `1`
indicate high predictive performances, while values close to `0` indicate poor predictive performances:
```{r logloss-ratio, echo=FALSE}
abs = summary(factor(fs.rm0$Atlas_class))[1]
pr = summary(factor(fs.rm0$Atlas_class))[2]
p = pr/(pr+abs)
log_r = ((-(1*log(p)+(1-1)*log(1-p))*pr)+(-(0*log(p)+(1-0)*log(1-p))*abs))/(pr+abs)
log_rf = 1-rf.rsmp[["aggr"]][["logloss.test.mean"]]/log_r
log_xg = 1-xg.rsmp[["aggr"]][["logloss.test.mean"]]/log_r
log_glm = 1-glm.rsmp[["aggr"]][["logloss.test.mean"]]/log_r
log_eml = 1-eml.rsmp[["aggr"]][["logloss.test.mean"]]/log_r
log_values = c(log_rf, log_xg, log_glm, log_eml)
names(log_values) = c("RF", "XGB", "GLM", "EML")
log_values
```
From this example, we can see that the ensemble model outperforms, although slightly, all
the other models. It is also interesting to notice how Random forest is the best component
model if we look at the ensemble model coefficients, but in terms of performances Xgboost
is actually the closest to the ensemble.
## Summary notes
In this tutorial we have reviewed some aspects of spatial and
spatiotemporal data and demonstrated how to use ML, specifically
Ensemble ML, to train spatiotemporal models and produce time-series of
predictions. We have also shown, using some synthetic and real-life datasets, how
incorrectly setting up training and cross-validation can lead to
over-fitting problems. This was done to help users realize that Machine Learning,
however trivial it might seem, is not a click-a-button process and solid knowledge
and understanding of advanced statistics (regression, hypothesis testing,
sampling and resampling, probability theory) is still required.
For spatiotemporal models, we recommend combining covariates that
can represent both long-term or accumulated effects of climate, together with
covariates that can represent daily to monthly oscillation of variables such as soil
moisture, temperatures and similar. During the design of the modeling system,
we highly recommend **trying to understand ecology and processes behind
variable of interest** first, then designing the modeling system to best reflect expert
knowledge. The example with RF over-fitting data in @gasch2015spatio shows how in-depth
understanding of the problem can help design modeling framework and prevent
from over-fitting problems and similar. @witjes2021spatiotemporal and @Bonannella2022 describe a more comprehensive
framework for spatiotemporal ML which can even be run on large datasets.
Where time-series EO data exists, these can be also incorporated into the
mapping algorithm as shown with three case studies. For spacetime overlays we
recommend using Cloud-Optimized GeoTIFFs and the `terra` package [@hijmans2019spatial] which helps speed-up overlays. Other options for efficient overlay are the `stars` and `gdalcubes` package [@appel2019demand].
Spatiotemporal datasets can be at the order of magnitude larger, hence
it is important, when implementing analysis of spatiotemporal data, to
consider **computing optimization**, which typically implies:
- Running operations in parallel;
- Separating fine-tuning and parameter optimization (best to run on
subset and save computing time) from predictions,
- Using tiling systems to run overlay, predictions and visualizations,
Finally, we recommend following these generic steps to fit spatiotemporal models:
0. Define target of interest and design the modeling framework by understanding
ecology and processes behind variable of interest.
1. Prepare training (points) data and a data cube with all covariates
referenced in spacetime.
2. Overlay points in spacetime, create a spatiotemporal
regression- or classification-matrix.
3. Add seasonal components, fine-tune initial model, reduce complexity
as much as possible, and produce production-ready spatiotemporal prediction model
(usually using Ensemble Machine Learning).
4. Run mapping accuracy assessment and determine prediction uncertainty
including the per pixel uncertainty.
5. Generate predictions in spacetime — create time-series of
predictions.
6. (optional) Run change-detection / trend analysis and try to detect
main drivers of positive / negative trends [@witjes2021spatiotemporal].
7. Deploy predictions as Cloud-Optimized GeoTIFF and produce the final
report with mapping accuracy, variable importance.
Ensemble ML framework we used here clearly offers many benefits, but it also comes
at a cost of at the order of magnitude higher computational load. Also
interpretation of such models can be a cumbersome as there a multiple
learners plus a meta-learner, so it often becomes difficult to
track-back individual relationship between variables. To help increase
confidence in the produced models, we recommend studying the [Interpretable Machine
Learning](https://christophm.github.io/interpretable-ml-book/) methods
[@molnar2020interpretable], running additional model diagnostics, and
intensively plotting data in space and spacetime and feature space.
Note that the `mlr` package is discontinued, so some of the examples
above might become unstable with time. You are advised instead to use
the new [mlr3 package](https://mlr3.mlr-org.com/).