-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3_2_PredictandUpdate_Hierarchy.Rmd
645 lines (506 loc) · 26.4 KB
/
3_2_PredictandUpdate_Hierarchy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
---
title: "MachineLearningHigherOrders"
author: "Will MacKenzie & Kiri Daust"
date: "05/07/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
require(tidyverse)
require(tidymodels)
require(data.table)
require(data.tree)
require(DataExplorer)
require(C50)
require(indicspecies)
require(doParallel)
require(DBI)
source("./_functions/_TabletoTree.R")
source("./_functions/_TreetoTable.R")
cloud_dir <- "F:/OneDrive - Personal/OneDrive/BEC_Classification_Paper/"
#cloud_dir <- "F:/OneDrive - Government of BC/CCISSv12/"
```
Steps to build classification hierarchy
1. A machine learning model of high-level hierarchical BEC Forest units
a. Create a ML model of Forested SubOrders using BECv11 site series that meet minimum plot requirements. (going to Orders only causes issues where Pl is a dominant secondary species - many provincial units get placed into a Pinucon Order when applied only to the order level even though they have a subdominant tree species which places them in a different order)
b. Use only tree species for first round of model build for the forested Orders.
c. Predict membership of new BECv13 units and assign to orders then rebuild ML model.
d. Review any mis predicted site series for reassignment.
e. Compare similarity of tree composition within units (noise clustering) to find outliers that may represent new Orders (test by leaving out known units)
f. Create vegetation summaries of Orders and compare vegetation (all species now). Identify climatic indicator species groups.
g. Create climatic and site attributes summary of Orders
h. Create maps of plot locations for each Order
i. Hierarchy tree graphic
2. Build Alliances. Focus should be on identifying species groups that reflect different site conditions.
a. Use some of the techniques of indicspecies package to create Alliances based on indicator group creation and analysis.
b. Try to build a machine learning model of Alliances
c. Vegetation and Environment summary of Alliances
3. Do pair-wise analysis of site series within Orders/Alliances using high constancy species to analyze site series and create Associations/SubAssociations.
a. Check for similarity between Associations of All orders
b. Build relationship graphic
4. Document hierarchy
5. Build model to assign new units and unassigned plots
Challenge here is to make a constant list of species between model and new units.
a. Predict membership of new site series (draft BECv13). Noise Clustering to test for novel units that do not fit any existing. Use machine learning to hierarchy
b. Predict Order membership of BECMaster plots that are unassigned to existing site series and add to an Order_unplaced site unit under each Order.
6. Build for non-forested units. Classes based on major site level differences rather than climate and using major species groups (e.g. Hydrophytic Carex spp )
May wish to assign a temporary notree pseudo species to all units that have no trees to help separate forested hierarchy from non-forested hierarchy (Hurdle model)
```{r import new data}
### Vegetation From BECMaster cleaning script Long form Plot|Species|Cover (across all layers)
### or From Vpro export
#BGCZone <- fread(paste0(cloud_dir,"All_BGCs_Info_v12_2.csv"), data.table=FALSE)
#vegDat2 <- fread("../BECMaster_Cleaning/clean_data/BECMaster_VegR_clean.csv", data.table = FALSE)
veg.dat <- readRDS("./clean_data/Analysis_BECMaster_Veg.rds")
sppmaster <- dbConnect(odbc::odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)}; DBQ=F:/OneDrive - Personal/OneDrive/BCSpeciesList/SpeciesTaxonomyMaster.accdb;")
# sppmaster <- dbConnect(odbc::odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)}; DBQ=C:/Users/whmacken/OneDrive/BCSpeciesList/SpeciesTaxonomyMaster.accdb;")
taxon.all <- dbReadTable(sppmaster, "USysAllSpecs")
dbDisconnect(sppmaster)
taxon.lifeform <- taxon.all %>% filter(Codetype == "U" |Codetype == "X") %>% dplyr::select(Code, ScientificName, EnglishName, Lifeform) %>% distinct
veg.dat2 <- veg.dat
master_su <- dbConnect(odbc::odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)};
DBQ=D:/BC_Correlation2_Vpro_2023/CoastGuide_Hierarchy.accdb;")
su <- dbReadTable(master_su, "Coast_Forest_2024v4_SU")
hier <- dbReadTable(master_su, "CoastForest_v2024_2_Hierarchy")
dbDisconnect(master_su)
#####################importing Vpro veg data########################################
############### Uses 3 column R export FORMAT FROM Vpro with Lifeform option selected
#load("./inputs/VegDat_Raw.RData")##includes type field for lifeform
#vegDat2 <- vegData %>% filter(!is.na(Type))
###SU table
# SUTab <- fread("./clean_tabs/BECv13Analysis_Forest_17Sept2021_SU.csv")
# Vpro.hier <- fread("./clean_tabs/BECv13_ForestHierarchy_v2_26Sept2021_Hierarchy.csv")
su$SiteUnit <- trimws(su$SiteUnit)
SS.names <- unique(su$SiteUnit)
Hier.matrix <- treeToTable(hier)
Hier.matrix <- Hier.matrix$table
# #####list of official current and future correlated units from BECdb
# current <- c('Current', 'Future')
# ecotypes <- c("Forest", "Deciduous")
# #### cleaned list of official site series
# BECdb_SS <- fread("./inputs/BECdb_SiteUnitv12.csv") %>% filter(Status %in% current) %>% filter(ArchivedinVersion == "") %>% filter(`Forest-NonForest` %in% ecotypes) %>%
# select(MergedBGC_SS, SS_NoSpace, SiteUnitLongName, SiteUnitScientificName) %>% rename (SiteUnit = 1, SSName = 3, SciName = 4) %>% distinct(SiteUnit, .keep_all = TRUE)
# #### compare SUTab to BECdb. Whan all BECdb units to have data and SUTab to include no unofficial units
# SS_missing <- full_join(SUTab, BECdb_SS) %>% filter(is.na(PlotNumber))
##Import Vpro hierarchy table widen and reformat
```
##roll up into site series summary data
```{r summarize site series, echo=FALSE}
#SUTab <- fread("./inputs/BECMster_V2020_2_SU.csv")
### Summarize by SU including mean cover and constancy percent
##roll up into site series summary data
### Filter out all species but tree species
vegDat <- as.data.table(veg.dat2)
vegDat[su, SiteUnit := i.SiteUnit, on = "PlotNumber"]
vegDat <- vegDat[!is.na(SiteUnit) & SiteUnit != "",]
vegDat <- unique(vegDat[!is.na(SiteUnit) & SiteUnit != "",])
vegDat3 <- vegDat[,if(.N > 1) .SD, by = .(SiteUnit,Species)]
vegDat3[,nPlots := length(unique(PlotNumber)), by = .(SiteUnit)]
veg.sum <- vegDat3[,.(MeanCov = sum(Cover, na.rm = TRUE)/nPlots[1], Constancy = (.N/nPlots[1])*100, nPlots = nPlots[1]), by = .(SiteUnit,Species)]
fwrite(vegSum, './clean_tabs/BECv13SiteUnitSummaryVegData.csv')
```
```{r reduce summary for analysis, echo=FALSE}
##limit life forms to tree species
trees <- c(1,2)
constCut <- 33 ##remove species less than cutoff
covercut <- 1
treespp <- setDT(taxon.all)[Lifeform %in% trees, ] %>% dplyr::select(Code)
treespp <- as.vector(treespp$Code)
vegDat_test <- veg.sum[Species %in% treespp,]### include only trees for hierarchy build
vegDat_test <- vegDat_test %>% dplyr::filter(MeanCov > covercut) %>% dplyr::filter(Constancy > covercut)
tree.sum <- as.data.table(vegDat_test)#vegDat <- as_tibble(vegDat)
tree.sum$SiteUnit <- as.factor(tree.sum$SiteUnit)
# vegDat3 <- vegDat %>% group_by(SiteUnit, Species) %>% filter(nPlots > 1)
# vegDat3 <- vegDat %>% group_by(SiteUnit, Species) %>% mutate(nPlots = n_distinct(PlotNumber))
# vegDat <- as.data.table(vegDat)
#-------------problem with the mean cover calculation
#
#vegSum <- vegDat[,.(MeanCov = sum(Cover)/nPlots[1], Constancy = (.N/nPlots[1])*100, nPlots = nPlots[1]), by = .(SiteUnit,Species)] %>% mutate(across(where(is.numeric), round, 3))
fwrite(tree.sum, './inputs/SiteUnitConiferSummary.csv')
##build grand summary table that then can be reduce to only species with high constancy in a least one unit. Probably not too useful for very large analyses
#### linkages to Site Association Analysis Units from other script will replace this.
```
```{r filter and prepare for analysis}
vegSum <- tree.sum
SS_good <- vegSum %>% filter(nPlots >3) %>% filter(Constancy >= 33) %>% distinct() #%>% rename(SiteUnit = SiteUnit))## Select only site series will enough plots
Hier.units <- Hier.matrix %>% dplyr::select(SiteUnit, Class, Order, Suborder) %>% distinct()
Hier.data <- left_join(Hier.units, SS_good) %>% filter(!is.na(nPlots)) %>% arrange(Species) %>% distinct()
#fwrite(Hier.data, './inputs/SiteUnitForested_w_HierarchyUnits.csv')
```
```{r filter and prepare for analysis}
#Hier.clean <- fread("./inputs/cleanedHierarchy.csv")
#vegSum <- fread('./inputs/SiteUnitForestedSummary.csv')
vegSum <- tree.sum
SS_good <- vegSum %>% filter(nPlots >=3) %>% filter(Constancy >= 33) %>% distinct() #%>% rename(SiteUnit = SiteUnit))## Select only site series will enough plots
class.dat <-
Hier.data %>% dplyr::select(SiteUnit, Suborder, Species, MeanCov, Constancy) %>%
pivot_wider(id_cols = c(SiteUnit, Suborder),
names_from = Species,
values_from = c(MeanCov, Constancy)) %>%
mutate(Suborder = ifelse(is.na(Suborder) | Suborder == "", "unplaced", Suborder)) %>% filter(!SiteUnit == "") %>% mutate_if(is.character, as.factor) %>%
replace(is.na(.),0) %>% distinct() %>% droplevels()
```
```{r fix new data to match species of model}
load("./rFmodels/BECv13_Suborders_rFmodel.Rdata")
BEC_new <- class.dat %>% filter(Suborder == "unplaced")
BEC_new2 <- BEC_new
BGC_new.var <- colnames(class.dat)
model_vars <- as.data.frame(BEC_ranger_model$fit$fit$fit$variable.importance) %>% tibble::rownames_to_column()
newcols <- model_vars[!model_vars$rowname %in% BGC_new.var,] %>% dplyr::select(1)
newcols <- newcols$rowname %>% as.character
#Run next line if species list does not match model
##BEC_new2 <- cbind(BEC_new, setNames( lapply(newcols, function(x) x=NA), newcols) )
BEC_new2 <- BEC_new2 %>% replace(is.na(.),0)
BEC_new2$Suborder <- NA %>% as.factor
BEC_new2$SiteUnit <- BEC_new2$SiteUnit %>% as.factor %>% droplevels()
#BEC_new2 <- BEC_new2 %>% dplyr::select(-SiteUnit)
```
```{r predict units and add info to hierarchy}
SU_names <- as.data.frame(BEC_new2$SiteUnit) %>% distinct() %>% rowid_to_column('.row') %>% dplyr::rename("SiteUnit" = 2)
BEC_recipe <-
recipe(Suborder ~ ., data = BEC_new2) %>%
update_role(SiteUnit, new_role = "id variable") %>%
prep()
summary(BEC_recipe)
BEC_fmodel <- rand_forest(mtry = 8, min_n = 4, trees = 501) %>%
set_mode("classification") %>%
set_engine("ranger", importance = "impurity") #or "permutations
BEC_workflow <- workflow() %>%
add_model(BEC_fmodel) %>%
add_recipe(BEC_recipe)
newBEC <- predict(BEC_ranger_model, BEC_new2) %>% bind_cols(BEC_new2 %>% dplyr::select(SiteUnit))# %>% rename(Name = SiteUnit) %>% distinct()
#newBEC$Name <- newBEC$Name %>% as.character
#####PROBLEM HERE
SUhier <- Vpro.hier
#SUhier <- fread("./outputs/UpdatedVPROHierarchyv12_updated.csv") %>% as.data.frame() ## import original tree for updating
SUhier.suborder <- SUhier %>% filter(Level ==4)
newBEC2 <- newBEC %>% rename(Name = .pred_class)
pred.ID <- newBEC2 %>% dplyr::select(Name) %>% distinct() %>% left_join(SUhier) %>% dplyr::select(Name, ID) ## Parent Code for new predictions
newBEC3 <- left_join(newBEC2, pred.ID, by = "Name") %>% dplyr::select(-Name) %>% rename(Parentnew = ID, Name = SiteUnit)
###updates units already in hierarchy
SUhier2 <- left_join(SUhier, newBEC3) %>% mutate(Parent = ifelse(!is.na(Parentnew), Parentnew, Parent)) %>%
dplyr::select(-Parentnew) %>% dplyr::select(ID, Name, Parent, Level)
SUhier.names <- SUhier2$Name %>% as.character
fwrite(SUhier2, "./outputs/UpdatedVPROHierarchyv13_updated_29Sept.csv")
```
```{r}
```
```{r append units not in hierarchy}
##append units not already in hierarchy
newunits <- newBEC3[!newBEC3$Name %in% SUhier.names,] %>% rename(Parent = Parentnew)
newunits$Level <- 11
#newunits$ID <- NA %>% as.integer
### return max value of SUhier2 ID then add running ID numbers to new units from max + 1
maxID <- max(SUhier2$ID) %>% as.integer
newnum <- nrow(newunits)
newunits$ID <- (maxID +1):(maxID + newnum)
newunits <- newunits %>% dplyr::select(ID, Name, Parent, Level)
SUhier3 <- rbind(SUhier2, newunits)
SUhier3$Parent <- ifelse(SUhier3$Parent == 1, "", SUhier3$Parent) %>% as.integer
###write out to Vpro format to review
fwrite(SUhier3, "./outputs/UpdatedVPROHierarchyv13_updated_28Sept.csv")
```
############not used below
```{r}
SUhier.matrix <- treeToTable(testReverse2)
Hier.clean2 <- SUhier2$table
Hier.update <- left_join(SUhier, newBEC) %>% mutate(suborder = ifelse(!is.na(.pred_class), .pred_class, suborder)) %>%
dplyr::select(-.pred_class) %>% filter(!suborder == "")
Hier.update <- as.data.table(Hier.update)
fwrite(Hier.update, "./clean_tabs/BEC13_HierarchyMatrixModelled_v2.csv")
```
```{r predict new unplaced}
newBEC <- predict(BEC_ranger_model, BEC_new) %>% bind_cols(BEC_new %>% dplyr::select(SiteUnit))
newBEC$.pred_class <- newBEC$.pred_class %>% as.character
Hier.update <- left_join(SUhier, newBEC) %>% mutate(hierunit = ifelse(!is.na(.pred_class), .pred_class, hierunit)) %>%
dplyr::select(-.pred_class) %>% filter(!hierunit == "")
Hier.update <- as.data.table(Hier.update)
fwrite(Hier.update, "./clean_tabs/BEC13_HierarchyMatrixModelled_v1.csv")
```
```{r create updated VPro hierarchy}
##convert matrix to tree, add in new units
Hier.update <- fread("./clean_tabs/BEC13_HierarchyMatrixModelled_v1.csv")
levelNames <- c("Formation", "Class", "Order", "Suborder", "Alliance", "Suball", "Assoc", "Subass", "Facies", "Working", "SiteUnit")
testReverse <- tableToTree(hierWide = copy(SUhier),levelNames) ## convert matrix to tree
newBEC2 <- newBEC %>% rename(Name = .pred_class)
pred.ID <- newBEC2 %>% select(Name) %>% distinct() %>% left_join(testReverse) %>% select(Name, ID) ## Parent Code for new predictions
newBEC3 <- left_join(newBEC2, pred.ID, by = "Name") %>% select(-Name) %>% rename(Parentnew = ID, Name = SiteUnit)
testReverse2 <- left_join(testReverse, newBEC3) %>% mutate(Parent, ifelse(!is.na(Parentnew), Parentnew, Parent)) %>%
select(-Parent, -Parentnew) %>% rename (Parent = 4) %>% select(ID, Name, Parent, Level)
testReverse2$Parent <- ifelse(testReverse2$Parent == 1, "", testReverse2$Parent)
fwrite(testReverse2, "./outputs/UpdatedVPROHierarchyv13.csv")
```
```{r some hierarchy stats}
classes <- unique(Hier.data$Class)
orders <- unique(Hier.data$Order)
suborders <- unique(Hier.data$Suborder)
### Choose hierarchical level for analysis
hierunit="Suborder"
class.dat <-
Hier.data %>% dplyr::select(SiteUnit, hierunit, Species, MeanCov) %>%
pivot_wider(id_cols = c(SiteUnit, hierunit),
names_from = Species,
values_from = MeanCov) %>%
mutate(hierunit = ifelse(hierunit == "", "unplaced", hierunit)) %>% filter(!SiteUnit == "") %>%
replace(is.na(.),0) %>%
mutate_if(is.character, as.factor) %>% distinct() %>% droplevels()
#DataExplorer::create_report(class.dat)
```
Data pre-processing includes the following steps:
```{r prep data, include = TRUE, echo = FALSE}
classID <- class.dat %>% dplyr::select(SiteUnit, hierunit)
#class.dat2 <- class.dat %>% select(-SiteUnit)
BEC_good <- class.dat %>% filter(!hierunit == "unplaced") %>% arrange(SiteUnit)
BEC_new <- class.dat %>% filter(hierunit == "unplaced") %>% arrange(SiteUnit)
SU_names <- as.data.frame(BEC_good$SiteUnit) %>% distinct() %>% rowid_to_column('.row') %>% dplyr::rename("SiteUnit" = 2)
# 1: split into training and test data (0.75/0.25) training and testing
# BEC_split <- initial_split(BEC_all,
# strata = hierunit,
# prop = 0.9)
#
# BEC_train <- training(BEC_split)
# BEC_test <- testing(BEC_split)
BEC_recipe <-
recipe(hierunit ~ ., data = BEC_good) %>%
update_role(SiteUnit, new_role = "id variable") %>%
#update_role(-SiteUnit, new_role = "predictor") %>%
#step_corr(all_numeric()) %>% # remove correlated covariates
#step_dummy(all_nominal(),-all_outcomes()) %>%
#step_zv(all_numeric()) %>% # remove values with no variance
# step_smote(Class) %>%
prep()
summary(BEC_recipe)
BEC_fmodel <- rand_forest(mtry = 5, min_n = 2, trees = 501) %>%
set_mode("classification") %>%
set_engine("ranger", importance = "impurity") #or "permutations
# note in RF as a tree based model it is not required to scale and normalize covariates and may have negative influence on the model performance
```
```{r build final model, include = FALSE}
# 2: set up cross validation for parameter tuning data sets # note vc is default to 10 fold
BEC_workflow <- workflow() %>%
add_model(BEC_fmodel) %>%
add_recipe(BEC_recipe)
BEC_ranger_model <- fit(BEC_workflow, BEC_good)
BEC_ranger_model$fit
save(BEC_ranger_model, file = "./rFmodels/BECv12_Orders_rFmodel.Rdata")
BEC.missed <- predict(BEC_ranger_model, BEC_good) %>% bind_cols(BEC_good %>% dplyr::select(SiteUnit, hierunit)) %>%
mutate(Compare = ifelse(.pred_class == hierunit, "Same", "Diff")) %>% filter(Compare == "Diff")
BEC.mis.list <- BEC.missed$SiteUnit %>% as.character
SS_misID_SU <- SUTab[SUTab$SiteUnit %in% BEC.mis.list,]
fwrite(SS_misID_SU, "./outputs/WrongOrderPredicted_SU.csv")
```
```{r merge predictions into hierarchy }
```
## Tuning run to identify appropriate hyper parameters. Only needs to be run a single time for a particular model development
```{r model tuning, include = FALSE, echo = FALSE}
# BEC_tunemodel <- rand_forest(mtry = tune(), min_n = tune(), trees = 501) %>%
# set_mode("classification") %>%
# set_engine("ranger", importance = "impurity") #or "permutations
#
# # randf_spec <- rand_forest(trees = 500) %>%
# # set_engine("randomForest") %>%
# # set_mode("classification")
#
# BEC_tuneworkflow <- workflow() %>%
# add_model(BEC_tunemodel) %>%
# add_recipe(BEC_recipe)
#
# ranger_tune_detail <-
# grid_regular(
# mtry(range = c(1, 20)),
# min_n(range = c(2, 10)),
# levels = 5)
#
# cv_metrics <- metric_set(accuracy, roc_auc, j_index)
#
# v = 10
# reps = 5
# set.seed(345)
# BEC_cvfold <- vfold_cv(BEC_all,
# v = v,
# repeats = reps,
# strata = Class)
# # re-run the tuning with the explicit parameter sets
# set.seed(4556)
# doParallel::registerDoParallel()
#
# ranger_regular_tune <-
# tune_grid(BEC_tuneworkflow,
# resamples = BEC_cvfold,
# metrics = cv_metrics,
# grid = ranger_tune_detail)
#
# autoplot(ranger_regular_tune) ##plots the results
#
# bestacc <- select_best(ranger_regular_tune, metric = "accuracy")
# bestroc <- select_best(ranger_regular_tune, metric = "roc_auc")
# bestj <- select_best(ranger_regular_tune, metric = "j_index")
```
##Import hierarchy matrix from 0_0_Plot_SU_Hierarchy script with original placements plus all unassigned units added to an unplaced Formation category
```{r import hierarchy data}
##Import wide matrix as training data
SUhier <- fread("./outputs/BECv12_Hierarchy_Matrix.csv")
#SUhier <- fread("./clean_tabs/BECv13_Hierarchy_Matrix.csv")
```
## Run cross validation model of all data using tuning from previous step. Can only be run with enough replicates
```{r define cv model with tuned parameters, echo = TRUE }
# define model with set parameters from tuning
split_data <- initial_split(model_data, prop = 0.6, strata = defense)
training_data <- training(split_data)
testing_data <- testing(split_data)
BEC_fmodel <- rand_forest(mtry = 5, min_n = 2, trees = 501) %>%
set_mode("classification") %>%
set_engine("ranger", importance = "impurity") #or "permutations
BEC_cv_workflow <- workflow() %>%
add_model(BEC_fmodel) %>%
add_recipe(BEC_recipe)
v = 10
reps = 5
set.seed(345)
BEC_cvfold <- vfold_cv(BEC_good,
v = v,
repeats = reps,
strata = Class)
cv_metrics <- metric_set(accuracy, roc_auc, j_index, sens, spec)
set.seed(130)
doParallel::registerDoParallel()
BEC_cv_fitted <-
BEC_cv_workflow %>%
fit_resamples(resamples = BEC_cvfold, metrics = cv_metrics, control = control_resamples(save_pred = TRUE))
```
## Collect accuracy metrics and predictions and improperly predicted site units
```{r collect metrics and predictions, echo = TRUE }
# collect accuracy metrics
cv_results <- BEC_cv_fitted %>% collect_metrics(summarize = FALSE)
cv_results_sum <- BEC_cv_fitted %>% collect_metrics()
BEC.pred <- BEC_cv_fitted %>% collect_predictions()# collect predictions
BEC.predx <- left_join( BEC.pred, SU_names)
### accuracy by siteunit
BEC.pred_acc <- BEC.pred %>%
group_by(Class) %>%
accuracy(Class, .pred_class)
#Summarize prediction
BEC.pred2 <- BEC_cv_fitted %>% collect_predictions(summarize = TRUE)# collect predictions
BEC.pred3 <- left_join(SU_names, BEC.pred2)
## Identify misplaced site units
MisID <- BEC.pred3 %>% select(SiteUnit, Class, .pred_class) %>% mutate(compare = if_else(Class == .pred_class, "Same", "Diff")) %>% filter(compare == "Diff")# %>% rename("SiteUnit" = SiteUnit)
fwrite(MisID, "Misplaced Site Series in Classes.csv")
##ID where there are different classifications predicted by fold **** this is not aligning SU_names properly
# BEC.pred_cv <- BEC_cv_fitted %>% collect_predictions(summarize = FALSE) %>% left_join(SU_names) %>% select(SiteUnit, Class, .pred_class) %>% group_by_all() %>% summarize(numcv = n())#distinct# dplyr::rename(BEC.pred = 3) %>%
# setDT(BEC.pred_cv, key = c("SiteUnit"))
# BEC.pred_cv[, N := .N, by = key(BEC.pred_cv)] # count rows per group
# class.diff <- BEC.pred_cv[N > 1]
```
## review and reassign site units to correct hierarchy units
```{r reassign units}
##Update class by MisID in BEC_all (to build final model) and in Hier.clean (to update the hierachy table)
UpdateClass <- MisID %>% select(SiteUnit, .pred_class)
Hier_update <- left_join(Hier.clean, UpdateClass, by = "SiteUnit") %>% mutate(newClass = coalesce(.pred_class, Class)) %>% select(-Class, -.pred_class) %>% rename(Class = newClass) %>% select(ID, Class, everything())
```
```{r reverse function to write hierarhcy back into Vpro format}
levelNames <- c("Region", "Class", "Order", "Suborder", "Alliance", "Suball",
"Assoc", "Subass", "Facies", "SiteUnit")
tableToTree <- function(hierWide, levelNames){
levelID <- data.table(Name = levelNames, Level = 1:length(levelNames))
levs <- melt(hierWide, id.vars = "ID")
levs[,ID := NULL]
levs <- unique(levs)
levs[levelID, Level := i.Level, on = c(variable = "Name")]
hierWide[,ID := NULL]
hierWide[,Root := "TempRoot"]
setcolorder(hierWide,c(ncol(hierWide),1:(ncol(hierWide)-1)))
paths <- tidyr::unite(hierWide, "pathString", na.rm = T, sep = "_")
tr <- as.Node(paths,pathDelimiter = "_")
tr$Set(Lab = tr$Get("name"))
tr$Set(name = 1:tr$totalCount)
dat <- ToDataFrameNetwork(tr,"Lab",direction = "climb")
dat <- as.data.table(dat)
setnames(dat, old = c("from","to","Lab"), new = c("Parent","ID","Name"))
dat[levs, Level := i.Level, on = c(Name = "value")]
setcolorder(dat,c("ID","Name","Parent","Level"))
return(dat)
}
testReverse <- tableToTree(hierWide = copy(Hier.clean),levelNames)
```
```{r C5 decision tree}
#For interpretation of hierarchy
#Cross-validation.
#BEC_C5_recipe <- recipe(Class ~ ., data = BEC_all)# %>%
#step_center(all_predictors()) %>%
# step_scale(all_predictors())
BEC_C5_model <- decision_tree(min_n = 2) %>% #, trees = NULL, min_n = NULL)# trees = 5, min_n = 2min_n = 10specify that the model is a random forest
#set_args(mtry = tune()) %>% specify that the `mtry` parameter needs to be tuned
set_engine("C5.0") %>% #, num.threads = (cores-1), importance = "impurity, ) %>% select the engine/package that underlies the model
set_mode(mode = "classification")
# set the workflow
BEC_C5_workflow <- workflow() %>%
add_recipe(BEC_recipe) %>%
add_model(BEC_C5_model)
#
BECmodel.C5 <- fit(BEC_C5_workflow, BEC_all)
out <- butcher(BECmodel.C5, verbose = TRUE)
C5model <- C5.0_train(class.dat, Class, minCases = 2)
C5model <- C5.0(x=class.dat[-1], y = class.dat$Class)
C5model
summary(C5model)
plot(C5model)
## build model for prediction
gc()
BECmodel.var <- pull_workflow_fit(BECmodel.tidy)$fit
BEC.pred <- as.data.frame(BECmodel.var$predicted)
MisID <- cbind(classID,BEC.pred)%>% dplyr::rename(BEC.pred = 3) %>% mutate(compare = if_else(Class == BEC.pred, "Same", "Diff"))
fwrite(MisID, "Misplaced Site Series in Classes.csv")
###Variable importance
varimp <- as.data.frame(BECmodel.var$importance)
covcount <- nrow(varimp)
#saveRDS(BGCmodel.tidy, file = paste("./BGC_models/WNAv12_Zone_", covcount, "_Var_tidyrf3.rds"))
#parsed <- parse_model(BGCmodel.var)
#write_yaml(parsed, "my_model.yml")
saveRDS(BECmodel.tidy, file= paste("./BEC_models/Forest_Classes_rf.rds"))
```
```{r}
BEC_split <- initial_split(class.dat, strata = Class, p = .8)
BEC_train <- training(BEC_split)
BEC_test <- testing(BEC_split)
BEC_model <- rand_forest(trees = 101, mtry = tune()) %>%# min_n = 10specify that the model is a random forest
#set_args(mtry = tune()) %>% specify that the `mtry` parameter needs to be tuned
set_engine("randomForest", num.threads = (cores-1)) %>% #, importance = "impurity, ) %>% select the engine/package that underlies the model
set_mode("classification")
# set the workflow
BEC_workflow <- workflow() %>%
add_recipe(BEC_recipe) %>%
add_model(BEC_model)
#
## build model for prediction
BECmodel.train <- BEC_workflow %>% fit(BEC_train)
BECmodel.var <- pull_workflow_fit(BECmodel.train)$fit
BECmodel.var
# trainIndex <- createDataPartition(class.dat$Class, p = .7,
# list = FALSE,
# times = 1)
#
#
# BEC_train <- XAll[ trainIndex,]
# BEC_train$BEC <- as.factor(BEC_train$BEC)
# BEC_test <- XAll[-trainIndex,]# %>% droplevels()
BECmodel_train <- ranger(BEC ~ ., data = BEC_train[-1],
num.trees = 501, seed = 12345,
splitrule = "extratrees", #""gini",
#always.split.variables = c("DD5","CMD.total", "PPT_JAS"), #,
#split.select.weights = var.weight.vec,
mtry = 5,
#max.depth = .5,
min.node.size = 5,
importance = "permutation", write.forest = TRUE, classification = TRUE)
BECmodel_train
test.pred <- predict(BECmodel.train, new_data = BEC_test[,-c(1)])
BEC.pred <- as.data.frame(test.pred) %>% tibble::rownames_to_column() %>% dplyr::rename("BEC.pred" = ".pred_class")
BEC.test <- BEC_test %>% select(Class) %>% cbind(BEC.pred) %>% select( -rowname) %>% mutate_if(is.character, as.factor)
levels(BEC.test$BEC.pred) <- levels(BEC.test$BEC)
BEC_accuracy <- BEC.test %>% # test set predictions
accuracy(truth = Class, BEC.pred)
table(BEC_accuracy)
```
```{r review misassigned SS}
```
```{r predict placement of new SS and plots}
```