3_2_PredictandUpdate_Hierarchy.Rmd

---
title: "MachineLearningHigherOrders"
author: "Will MacKenzie & Kiri Daust"
date: "05/07/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)

require(tidyverse)
require(tidymodels)
require(data.table)
require(data.tree)
require(DataExplorer)
require(C50)
require(indicspecies)
require(doParallel)
require(DBI)
source("./_functions/_TabletoTree.R")
source("./_functions/_TreetoTable.R")
cloud_dir <- "F:/OneDrive - Personal/OneDrive/BEC_Classification_Paper/"
#cloud_dir <- "F:/OneDrive - Government of BC/CCISSv12/"
```

Steps to build classification hierarchy

1. A machine learning model of high-level hierarchical BEC Forest units
a. Create a ML model of Forested SubOrders using BECv11 site series that meet minimum plot requirements. (going to Orders only causes issues where Pl is a dominant secondary species - many provincial units get placed into a Pinucon Order when applied only to the order level even though they have a subdominant tree species which places them in a different order)
b. Use only tree species for first round of model build for the forested Orders.
c. Predict membership of new BECv13 units and assign to orders then rebuild ML model.
d. Review any mis predicted site series for reassignment.
e. Compare similarity of tree composition within units (noise clustering) to find outliers that may represent new Orders (test by leaving out known units)
f. Create vegetation summaries of Orders and compare vegetation (all species now). Identify climatic indicator species groups.
g. Create climatic and site attributes summary of Orders 
h. Create maps of plot locations for each Order
i. Hierarchy tree graphic

2. Build Alliances. Focus should be on identifying species groups that reflect different site conditions.

a. Use some of the techniques of indicspecies package to create Alliances based on indicator group creation and analysis.
b. Try to build a machine learning model of Alliances
c. Vegetation and Environment summary of Alliances

3. Do pair-wise analysis of site series within Orders/Alliances using high constancy species to analyze site series and create Associations/SubAssociations.
a. Check for similarity between Associations of All orders
b. Build relationship graphic


4. Document hierarchy

5. Build model to assign new units and unassigned plots
Challenge here is to make a constant list of species between model and new units.
a. Predict membership of new site series (draft BECv13). Noise Clustering to test for novel units that do not fit any existing. Use machine learning to hierarchy
b. Predict Order membership of BECMaster plots that are unassigned to existing site series and add to an Order_unplaced site unit under each Order.

6. Build for non-forested units. Classes based on major site level differences rather than climate and using major species groups (e.g. Hydrophytic Carex spp )
May wish to assign a temporary notree pseudo species to all units that have no trees to help separate forested hierarchy from non-forested hierarchy (Hurdle model)

```{r import new data}
### Vegetation From BECMaster cleaning script Long form Plot|Species|Cover (across all layers)
### or From Vpro export
#BGCZone <- fread(paste0(cloud_dir,"All_BGCs_Info_v12_2.csv"), data.table=FALSE)
#vegDat2 <- fread("../BECMaster_Cleaning/clean_data/BECMaster_VegR_clean.csv", data.table = FALSE)

veg.dat <- readRDS("./clean_data/Analysis_BECMaster_Veg.rds")

sppmaster <- dbConnect(odbc::odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)}; DBQ=F:/OneDrive - Personal/OneDrive/BCSpeciesList/SpeciesTaxonomyMaster.accdb;")
# sppmaster <- dbConnect(odbc::odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)}; DBQ=C:/Users/whmacken/OneDrive/BCSpeciesList/SpeciesTaxonomyMaster.accdb;")
taxon.all  <- dbReadTable(sppmaster, "USysAllSpecs")
dbDisconnect(sppmaster)
taxon.lifeform <- taxon.all %>% filter(Codetype == "U" |Codetype == "X") %>% dplyr::select(Code, ScientificName, EnglishName, Lifeform) %>% distinct
veg.dat2 <- veg.dat

master_su <- dbConnect(odbc::odbc(), .connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)};
DBQ=D:/BC_Correlation2_Vpro_2023/CoastGuide_Hierarchy.accdb;")
su <- dbReadTable(master_su, "Coast_Forest_2024v4_SU")
hier <- dbReadTable(master_su, "CoastForest_v2024_2_Hierarchy")
dbDisconnect(master_su)

#####################importing Vpro veg data########################################
############### Uses 3 column R export FORMAT FROM Vpro with Lifeform option selected
#load("./inputs/VegDat_Raw.RData")##includes type field for lifeform
#vegDat2 <- vegData %>% filter(!is.na(Type))
###SU table
# SUTab <- fread("./clean_tabs/BECv13Analysis_Forest_17Sept2021_SU.csv")
# Vpro.hier <- fread("./clean_tabs/BECv13_ForestHierarchy_v2_26Sept2021_Hierarchy.csv")
su$SiteUnit <-  trimws(su$SiteUnit)
SS.names <- unique(su$SiteUnit)
Hier.matrix <- treeToTable(hier)
Hier.matrix <- Hier.matrix$table
# #####list of official current and future correlated units from BECdb
# current <- c('Current', 'Future')
# ecotypes <- c("Forest", "Deciduous")
# #### cleaned list of official site series
# BECdb_SS <- fread("./inputs/BECdb_SiteUnitv12.csv") %>% filter(Status %in% current) %>% filter(ArchivedinVersion == "") %>% filter(`Forest-NonForest` %in% ecotypes) %>% 
#   select(MergedBGC_SS, SS_NoSpace, SiteUnitLongName, SiteUnitScientificName) %>% rename (SiteUnit = 1, SSName = 3, SciName = 4) %>% distinct(SiteUnit, .keep_all = TRUE)
# #### compare SUTab to BECdb.  Whan all BECdb units to have data and SUTab to include no unofficial units
# SS_missing <- full_join(SUTab, BECdb_SS) %>% filter(is.na(PlotNumber))
##Import Vpro hierarchy table widen and reformat
```


##roll up into site series summary data
```{r summarize site series, echo=FALSE}
#SUTab <- fread("./inputs/BECMster_V2020_2_SU.csv")
### Summarize by SU including mean cover and constancy percent
##roll up into site series summary data
### Filter out all species but tree species
vegDat <- as.data.table(veg.dat2)

vegDat[su, SiteUnit := i.SiteUnit, on = "PlotNumber"]
vegDat <- vegDat[!is.na(SiteUnit) & SiteUnit != "",]

vegDat <- unique(vegDat[!is.na(SiteUnit) & SiteUnit != "",])
vegDat3 <- vegDat[,if(.N > 1) .SD, by = .(SiteUnit,Species)]
vegDat3[,nPlots := length(unique(PlotNumber)), by = .(SiteUnit)]
veg.sum <- vegDat3[,.(MeanCov = sum(Cover, na.rm = TRUE)/nPlots[1], Constancy = (.N/nPlots[1])*100, nPlots = nPlots[1]), by = .(SiteUnit,Species)]
fwrite(vegSum, './clean_tabs/BECv13SiteUnitSummaryVegData.csv')
```


```{r reduce summary for analysis, echo=FALSE}
##limit life forms to tree species
trees <- c(1,2)
constCut <- 33 ##remove species less than cutoff
covercut <- 1
treespp <- setDT(taxon.all)[Lifeform %in% trees, ] %>% dplyr::select(Code)
treespp <- as.vector(treespp$Code)
vegDat_test <- veg.sum[Species %in% treespp,]### include only trees for hierarchy build
vegDat_test <- vegDat_test %>% dplyr::filter(MeanCov > covercut) %>% dplyr::filter(Constancy > covercut)


tree.sum <- as.data.table(vegDat_test)#vegDat <- as_tibble(vegDat)
tree.sum$SiteUnit <- as.factor(tree.sum$SiteUnit)
# vegDat3 <- vegDat %>% group_by(SiteUnit, Species) %>% filter(nPlots > 1)

# vegDat3 <- vegDat %>% group_by(SiteUnit, Species) %>% mutate(nPlots = n_distinct(PlotNumber))
# vegDat <- as.data.table(vegDat)

#-------------problem with the mean cover calculation
# 
#vegSum <- vegDat[,.(MeanCov = sum(Cover)/nPlots[1], Constancy = (.N/nPlots[1])*100, nPlots = nPlots[1]), by = .(SiteUnit,Species)] %>% mutate(across(where(is.numeric), round, 3))

fwrite(tree.sum, './inputs/SiteUnitConiferSummary.csv')

##build grand summary table that then can be reduce to only species with high constancy in a least one unit. Probably not too useful for very large analyses

#### linkages to Site Association Analysis Units from other script will replace this.

```

```{r filter and prepare for analysis}

vegSum <- tree.sum
SS_good <- vegSum %>% filter(nPlots >3) %>% filter(Constancy >= 33)  %>% distinct()  #%>% rename(SiteUnit = SiteUnit))## Select only site series will enough plots
Hier.units <- Hier.matrix %>% dplyr::select(SiteUnit, Class, Order, Suborder) %>% distinct()
Hier.data <- left_join(Hier.units, SS_good) %>% filter(!is.na(nPlots)) %>% arrange(Species) %>% distinct()
#fwrite(Hier.data, './inputs/SiteUnitForested_w_HierarchyUnits.csv')
```

```{r filter and prepare for analysis}
#Hier.clean <- fread("./inputs/cleanedHierarchy.csv")
#vegSum <- fread('./inputs/SiteUnitForestedSummary.csv')
vegSum <- tree.sum
SS_good <- vegSum %>% filter(nPlots >=3) %>% filter(Constancy >= 33)  %>% distinct()  #%>% rename(SiteUnit = SiteUnit))## Select only site series will enough plots

class.dat <-
  Hier.data %>% dplyr::select(SiteUnit, Suborder, Species, MeanCov, Constancy) %>% 
  pivot_wider(id_cols = c(SiteUnit, Suborder),
    names_from = Species,
    values_from = c(MeanCov, Constancy)) %>% 
  mutate(Suborder = ifelse(is.na(Suborder) | Suborder == "", "unplaced", Suborder)) %>% filter(!SiteUnit == "") %>%  mutate_if(is.character, as.factor) %>% 
   replace(is.na(.),0) %>% distinct() %>% droplevels()
```


```{r fix new data to match species of model}
load("./rFmodels/BECv13_Suborders_rFmodel.Rdata")
BEC_new <- class.dat %>% filter(Suborder == "unplaced")
BEC_new2 <- BEC_new
BGC_new.var <- colnames(class.dat)
model_vars <- as.data.frame(BEC_ranger_model$fit$fit$fit$variable.importance) %>% tibble::rownames_to_column()
newcols <- model_vars[!model_vars$rowname %in% BGC_new.var,] %>% dplyr::select(1)
newcols <- newcols$rowname %>% as.character
#Run next line if species list does not match model
##BEC_new2 <- cbind(BEC_new, setNames( lapply(newcols, function(x) x=NA), newcols) )
BEC_new2 <- BEC_new2 %>%  replace(is.na(.),0)
BEC_new2$Suborder <- NA %>% as.factor
BEC_new2$SiteUnit <- BEC_new2$SiteUnit %>% as.factor %>% droplevels()
#BEC_new2 <- BEC_new2 %>% dplyr::select(-SiteUnit)
```

```{r predict units and add info to hierarchy}
SU_names <- as.data.frame(BEC_new2$SiteUnit) %>%  distinct() %>% rowid_to_column('.row') %>% dplyr::rename("SiteUnit" = 2)

BEC_recipe <-
    recipe(Suborder ~ ., data = BEC_new2) %>%
     update_role(SiteUnit, new_role = "id variable") %>% 
    prep() 
    summary(BEC_recipe)
    
BEC_fmodel <- rand_forest(mtry = 8, min_n = 4, trees = 501) %>%
  set_mode("classification") %>%
  set_engine("ranger", importance = "impurity") #or "permutations

BEC_workflow <- workflow() %>%
  add_model(BEC_fmodel) %>% 
    add_recipe(BEC_recipe)

newBEC <- predict(BEC_ranger_model, BEC_new2) %>% bind_cols(BEC_new2 %>% dplyr::select(SiteUnit))# %>%  rename(Name = SiteUnit) %>% distinct()
#newBEC$Name <- newBEC$Name  %>%  as.character
#####PROBLEM HERE
SUhier <- Vpro.hier
#SUhier <- fread("./outputs/UpdatedVPROHierarchyv12_updated.csv") %>% as.data.frame() ## import original tree for updating
SUhier.suborder <- SUhier %>% filter(Level ==4)

newBEC2 <- newBEC %>% rename(Name = .pred_class)
pred.ID <- newBEC2 %>% dplyr::select(Name) %>% distinct() %>% left_join(SUhier) %>% dplyr::select(Name, ID) ## Parent Code for new predictions
newBEC3 <- left_join(newBEC2, pred.ID, by = "Name") %>% dplyr::select(-Name) %>% rename(Parentnew = ID, Name = SiteUnit)
###updates units already in hierarchy
SUhier2 <- left_join(SUhier, newBEC3) %>% mutate(Parent = ifelse(!is.na(Parentnew), Parentnew, Parent)) %>% 
            dplyr::select(-Parentnew) %>% dplyr::select(ID, Name, Parent, Level)
SUhier.names <- SUhier2$Name %>% as.character
fwrite(SUhier2, "./outputs/UpdatedVPROHierarchyv13_updated_29Sept.csv")

```

```{r}

```


```{r append units not in hierarchy}
##append units not already in hierarchy
newunits <- newBEC3[!newBEC3$Name %in% SUhier.names,] %>% rename(Parent = Parentnew)
newunits$Level <- 11
#newunits$ID <- NA %>% as.integer

### return max value of SUhier2 ID then add running ID numbers to new units from max + 1
maxID <- max(SUhier2$ID) %>% as.integer
newnum <- nrow(newunits)
newunits$ID <-  (maxID +1):(maxID  + newnum)
newunits <- newunits %>% dplyr::select(ID, Name, Parent, Level)
SUhier3 <- rbind(SUhier2, newunits)


SUhier3$Parent <- ifelse(SUhier3$Parent == 1, "", SUhier3$Parent) %>% as.integer
###write out to Vpro format to review
fwrite(SUhier3, "./outputs/UpdatedVPROHierarchyv13_updated_28Sept.csv")
```


############not used below


```{r}
SUhier.matrix <- treeToTable(testReverse2)


Hier.clean2 <- SUhier2$table


Hier.update <- left_join(SUhier, newBEC) %>% mutate(suborder = ifelse(!is.na(.pred_class), .pred_class, suborder)) %>%
        dplyr::select(-.pred_class) %>% filter(!suborder == "")
Hier.update <- as.data.table(Hier.update)
fwrite(Hier.update, "./clean_tabs/BEC13_HierarchyMatrixModelled_v2.csv")
```

```{r predict new unplaced}


newBEC <- predict(BEC_ranger_model, BEC_new) %>% bind_cols(BEC_new %>% dplyr::select(SiteUnit)) 
newBEC$.pred_class <- newBEC$.pred_class  %>%  as.character

Hier.update <- left_join(SUhier, newBEC) %>% mutate(hierunit = ifelse(!is.na(.pred_class), .pred_class, hierunit)) %>%
        dplyr::select(-.pred_class) %>% filter(!hierunit == "")
Hier.update <- as.data.table(Hier.update)
fwrite(Hier.update, "./clean_tabs/BEC13_HierarchyMatrixModelled_v1.csv")
```


```{r create updated VPro hierarchy}
##convert matrix to tree, add in new units
Hier.update <- fread("./clean_tabs/BEC13_HierarchyMatrixModelled_v1.csv")
levelNames <- c("Formation", "Class", "Order", "Suborder", "Alliance", "Suball", "Assoc", "Subass", "Facies", "Working", "SiteUnit")
testReverse <- tableToTree(hierWide = copy(SUhier),levelNames) ## convert matrix to tree
newBEC2 <- newBEC %>% rename(Name = .pred_class)
pred.ID <- newBEC2 %>% select(Name) %>% distinct() %>% left_join(testReverse) %>% select(Name, ID) ## Parent Code for new predictions
newBEC3 <- left_join(newBEC2, pred.ID, by = "Name") %>% select(-Name) %>% rename(Parentnew = ID, Name = SiteUnit)

testReverse2 <- left_join(testReverse, newBEC3) %>% mutate(Parent, ifelse(!is.na(Parentnew), Parentnew, Parent)) %>% 
            select(-Parent, -Parentnew) %>% rename (Parent = 4) %>% select(ID, Name, Parent, Level)
testReverse2$Parent <- ifelse(testReverse2$Parent == 1, "", testReverse2$Parent)
fwrite(testReverse2, "./outputs/UpdatedVPROHierarchyv13.csv")

```

```{r some hierarchy stats}
classes <- unique(Hier.data$Class)
orders <- unique(Hier.data$Order)
suborders <- unique(Hier.data$Suborder)

### Choose hierarchical level for analysis
hierunit="Suborder"

class.dat <-
  Hier.data %>% dplyr::select(SiteUnit, hierunit, Species, MeanCov) %>% 
  pivot_wider(id_cols = c(SiteUnit, hierunit),
    names_from = Species,
    values_from = MeanCov) %>% 
  mutate(hierunit = ifelse(hierunit == "", "unplaced", hierunit)) %>% filter(!SiteUnit == "") %>% 
   replace(is.na(.),0) %>% 
  mutate_if(is.character, as.factor) %>% distinct() %>% droplevels()
#DataExplorer::create_report(class.dat)
```
Data pre-processing includes the following steps:  

```{r prep data, include = TRUE, echo = FALSE}

classID <- class.dat %>% dplyr::select(SiteUnit, hierunit)
#class.dat2 <- class.dat %>% select(-SiteUnit)
BEC_good <- class.dat %>% filter(!hierunit == "unplaced") %>% arrange(SiteUnit)
BEC_new <- class.dat %>% filter(hierunit == "unplaced") %>% arrange(SiteUnit)
SU_names <- as.data.frame(BEC_good$SiteUnit) %>%  distinct() %>% rowid_to_column('.row') %>% dplyr::rename("SiteUnit" = 2)
# 1: split into training and test data (0.75/0.25) training and testing 

# BEC_split <-  initial_split(BEC_all, 
#                               strata = hierunit,
#                               prop = 0.9)
#   
# BEC_train <- training(BEC_split)
# BEC_test <- testing(BEC_split)
BEC_recipe <-
    recipe(hierunit ~ ., data = BEC_good) %>%
     update_role(SiteUnit, new_role = "id variable") %>% 
  #update_role(-SiteUnit, new_role = "predictor") %>% 
    #step_corr(all_numeric()) %>%        # remove correlated covariates
    #step_dummy(all_nominal(),-all_outcomes()) %>%    
    #step_zv(all_numeric()) %>%          # remove values with no variance
   # step_smote(Class) %>%
    prep() 
    summary(BEC_recipe)
    
BEC_fmodel <- rand_forest(mtry = 5, min_n = 2, trees = 501) %>%
  set_mode("classification") %>%
  set_engine("ranger", importance = "impurity") #or "permutations

# note in RF as a tree based model it is not required to scale and normalize covariates and may have negative influence on the model performance  
```

```{r build final model, include = FALSE}

# 2: set up cross validation for parameter tuning data sets # note vc is default to 10 fold
BEC_workflow <- workflow() %>%
  add_model(BEC_fmodel) %>% 
    add_recipe(BEC_recipe)

BEC_ranger_model <- fit(BEC_workflow, BEC_good)
BEC_ranger_model$fit

save(BEC_ranger_model, file = "./rFmodels/BECv12_Orders_rFmodel.Rdata")

BEC.missed <- predict(BEC_ranger_model, BEC_good) %>% bind_cols(BEC_good %>% dplyr::select(SiteUnit, hierunit)) %>% 
  mutate(Compare = ifelse(.pred_class == hierunit, "Same", "Diff")) %>% filter(Compare == "Diff")
BEC.mis.list <- BEC.missed$SiteUnit %>% as.character

SS_misID_SU <- SUTab[SUTab$SiteUnit %in% BEC.mis.list,]
fwrite(SS_misID_SU, "./outputs/WrongOrderPredicted_SU.csv")
```


```{r merge predictions into hierarchy }


```


## Tuning run to identify appropriate hyper parameters. Only needs to be run a single time for a particular model development

```{r model tuning, include = FALSE, echo = FALSE}

# BEC_tunemodel <- rand_forest(mtry = tune(), min_n = tune(), trees = 501) %>%
#   set_mode("classification") %>%
#   set_engine("ranger", importance = "impurity") #or "permutations
# 
# # randf_spec <- rand_forest(trees = 500) %>%
# #   set_engine("randomForest") %>%
# #   set_mode("classification")
# 
# BEC_tuneworkflow <- workflow() %>%
#   add_model(BEC_tunemodel) %>%
#     add_recipe(BEC_recipe)
# 
# ranger_tune_detail <-
#   grid_regular(
#     mtry(range = c(1, 20)),
#     min_n(range = c(2, 10)),
#     levels = 5)
# 
# cv_metrics <- metric_set(accuracy, roc_auc, j_index)
# 
# v = 10
# reps = 5
# set.seed(345)
# BEC_cvfold <- vfold_cv(BEC_all, 
#                          v = v, 
#                          repeats = reps, 
#                          strata = Class)
# # re-run the tuning with the explicit parameter sets
# set.seed(4556)
# doParallel::registerDoParallel()
# 
# ranger_regular_tune <-
#   tune_grid(BEC_tuneworkflow,
#             resamples = BEC_cvfold,
#             metrics = cv_metrics,
#             grid = ranger_tune_detail)
# 
# autoplot(ranger_regular_tune) ##plots the results
# 
# bestacc <- select_best(ranger_regular_tune, metric = "accuracy")
# bestroc <- select_best(ranger_regular_tune, metric = "roc_auc")
# bestj <- select_best(ranger_regular_tune, metric = "j_index")

```
##Import hierarchy matrix from 0_0_Plot_SU_Hierarchy script with original placements plus all unassigned units added to an unplaced Formation category
```{r import  hierarchy data}
##Import wide matrix as training data
SUhier <- fread("./outputs/BECv12_Hierarchy_Matrix.csv")
#SUhier <- fread("./clean_tabs/BECv13_Hierarchy_Matrix.csv")

```
## Run cross validation model of all data using tuning from previous step. Can only be run with enough replicates
```{r define cv model with tuned parameters, echo = TRUE }
# define model with set parameters from tuning

split_data <- initial_split(model_data, prop = 0.6, strata = defense)
training_data <- training(split_data)
testing_data <- testing(split_data)


BEC_fmodel <- rand_forest(mtry = 5, min_n = 2, trees = 501) %>%
  set_mode("classification") %>%
  set_engine("ranger", importance = "impurity") #or "permutations

BEC_cv_workflow <- workflow() %>% 
  add_model(BEC_fmodel) %>% 
  add_recipe(BEC_recipe)

v = 10
reps = 5
  
set.seed(345)
BEC_cvfold <- vfold_cv(BEC_good, 
                         v = v, 
                         repeats = reps, 
                         strata = Class)
cv_metrics <- metric_set(accuracy, roc_auc, j_index, sens, spec)

set.seed(130)
doParallel::registerDoParallel()
BEC_cv_fitted <- 
  BEC_cv_workflow %>% 
  fit_resamples(resamples = BEC_cvfold, metrics = cv_metrics, control = control_resamples(save_pred = TRUE))
```

## Collect accuracy metrics and predictions and improperly predicted site units
```{r collect metrics and predictions, echo = TRUE }
# collect accuracy metrics  
cv_results <- BEC_cv_fitted  %>% collect_metrics(summarize = FALSE)
cv_results_sum <- BEC_cv_fitted  %>% collect_metrics()
BEC.pred <- BEC_cv_fitted %>% collect_predictions()# collect predictions
BEC.predx <- left_join( BEC.pred, SU_names)
### accuracy by siteunit
BEC.pred_acc <- BEC.pred %>% 
  group_by(Class) %>% 
  accuracy(Class, .pred_class)
#Summarize prediction
BEC.pred2 <- BEC_cv_fitted %>% collect_predictions(summarize = TRUE)# collect predictions
BEC.pred3 <- left_join(SU_names, BEC.pred2)

## Identify misplaced site units
MisID <- BEC.pred3 %>% select(SiteUnit, Class, .pred_class) %>% mutate(compare = if_else(Class == .pred_class, "Same", "Diff")) %>% filter(compare == "Diff")# %>% rename("SiteUnit" = SiteUnit)
fwrite(MisID, "Misplaced Site Series in Classes.csv")

##ID where there are different classifications predicted by fold **** this is not aligning SU_names properly
# BEC.pred_cv <- BEC_cv_fitted %>% collect_predictions(summarize = FALSE) %>% left_join(SU_names) %>% select(SiteUnit, Class, .pred_class) %>% group_by_all() %>% summarize(numcv = n())#distinct# dplyr::rename(BEC.pred = 3) %>% 
# setDT(BEC.pred_cv, key = c("SiteUnit"))
# BEC.pred_cv[, N := .N, by = key(BEC.pred_cv)]                # count rows per group
# class.diff <- BEC.pred_cv[N > 1]

```
## review and reassign site units to correct hierarchy units
```{r reassign units}
##Update class by MisID in BEC_all (to build final model) and in Hier.clean (to update the hierachy table)
UpdateClass <- MisID %>% select(SiteUnit, .pred_class)
Hier_update <- left_join(Hier.clean, UpdateClass, by  = "SiteUnit") %>% mutate(newClass = coalesce(.pred_class, Class)) %>% select(-Class, -.pred_class) %>% rename(Class = newClass) %>% select(ID, Class, everything())
```

```{r reverse function to write hierarhcy back into Vpro format}
levelNames <- c("Region", "Class", "Order", "Suborder", "Alliance", "Suball", 
"Assoc", "Subass", "Facies", "SiteUnit")

tableToTree <- function(hierWide, levelNames){
  levelID <- data.table(Name = levelNames, Level = 1:length(levelNames))
  
  levs <- melt(hierWide, id.vars = "ID")
  levs[,ID := NULL]
  levs <- unique(levs)
  levs[levelID, Level := i.Level, on = c(variable = "Name")]
  
  hierWide[,ID := NULL]
  hierWide[,Root := "TempRoot"]
  setcolorder(hierWide,c(ncol(hierWide),1:(ncol(hierWide)-1)))
  paths <- tidyr::unite(hierWide, "pathString", na.rm = T, sep = "_")
  tr <- as.Node(paths,pathDelimiter = "_")
  tr$Set(Lab = tr$Get("name"))
  tr$Set(name = 1:tr$totalCount)
  dat <- ToDataFrameNetwork(tr,"Lab",direction = "climb")
  dat <- as.data.table(dat)
  setnames(dat, old = c("from","to","Lab"), new = c("Parent","ID","Name"))
  dat[levs, Level := i.Level, on = c(Name = "value")]
  setcolorder(dat,c("ID","Name","Parent","Level"))
  return(dat)
}


testReverse <- tableToTree(hierWide = copy(Hier.clean),levelNames)
```


```{r C5 decision tree}
#For interpretation of hierarchy
#Cross-validation.

#BEC_C5_recipe <- recipe(Class ~ .,  data = BEC_all)# %>%
    #step_center(all_predictors()) %>%
 # step_scale(all_predictors())
BEC_C5_model <- decision_tree(min_n = 2) %>% #, trees = NULL, min_n = NULL)# trees = 5, min_n = 2min_n = 10specify that the model is a random forest
  #set_args(mtry = tune()) %>% specify that the `mtry` parameter needs to be tuned
  set_engine("C5.0") %>%  #, num.threads = (cores-1), importance = "impurity, ) %>% select the engine/package that underlies the model
   set_mode(mode = "classification")

 # set the workflow
BEC_C5_workflow <- workflow() %>%
  add_recipe(BEC_recipe) %>%
  add_model(BEC_C5_model)
#
BECmodel.C5 <- fit(BEC_C5_workflow, BEC_all)
out <- butcher(BECmodel.C5, verbose = TRUE)


C5model <- C5.0_train(class.dat, Class, minCases = 2)
C5model <- C5.0(x=class.dat[-1], y = class.dat$Class)
C5model
summary(C5model)
plot(C5model)

## build model for prediction

gc()

 BECmodel.var <- pull_workflow_fit(BECmodel.tidy)$fit
BEC.pred <-   as.data.frame(BECmodel.var$predicted)
MisID <- cbind(classID,BEC.pred)%>% dplyr::rename(BEC.pred = 3) %>% mutate(compare = if_else(Class == BEC.pred, "Same", "Diff"))
fwrite(MisID, "Misplaced Site Series in Classes.csv")
###Variable importance
varimp <- as.data.frame(BECmodel.var$importance)
covcount <- nrow(varimp)


#saveRDS(BGCmodel.tidy, file = paste("./BGC_models/WNAv12_Zone_", covcount, "_Var_tidyrf3.rds"))
#parsed <- parse_model(BGCmodel.var)
#write_yaml(parsed, "my_model.yml")
saveRDS(BECmodel.tidy, file= paste("./BEC_models/Forest_Classes_rf.rds"))

```


```{r}
BEC_split <- initial_split(class.dat, strata = Class, p = .8)
BEC_train <- training(BEC_split)
BEC_test <- testing(BEC_split)


BEC_model <- rand_forest(trees = 101, mtry = tune()) %>%# min_n = 10specify that the model is a random forest
  #set_args(mtry = tune()) %>% specify that the `mtry` parameter needs to be tuned
  set_engine("randomForest", num.threads = (cores-1)) %>%  #, importance = "impurity, ) %>% select the engine/package that underlies the model
   set_mode("classification")

 # set the workflow
BEC_workflow <- workflow() %>%
  add_recipe(BEC_recipe) %>%
  add_model(BEC_model)
#
## build model for prediction
BECmodel.train <- BEC_workflow %>% fit(BEC_train)
 BECmodel.var <- pull_workflow_fit(BECmodel.train)$fit
 BECmodel.var


 # trainIndex <- createDataPartition(class.dat$Class, p = .7,
#                                   list = FALSE,
#                                   times = 1)
# 
# 
# BEC_train <- XAll[ trainIndex,]
# BEC_train$BEC <- as.factor(BEC_train$BEC)
# BEC_test  <- XAll[-trainIndex,]# %>% droplevels()

BECmodel_train <- ranger(BEC ~ ., data = BEC_train[-1],
                           num.trees = 501,  seed = 12345,
                            splitrule =  "extratrees", #""gini",
                            #always.split.variables = c("DD5","CMD.total", "PPT_JAS"), #,
                            #split.select.weights = var.weight.vec,
                    mtry = 5,
                          #max.depth = .5,
                    min.node.size = 5,
                           importance = "permutation", write.forest = TRUE, classification = TRUE)

BECmodel_train

 test.pred <- predict(BECmodel.train, new_data = BEC_test[,-c(1)])
  BEC.pred <- as.data.frame(test.pred) %>% tibble::rownames_to_column() %>% dplyr::rename("BEC.pred" = ".pred_class")

BEC.test <- BEC_test %>% select(Class) %>% cbind(BEC.pred) %>%  select( -rowname) %>% mutate_if(is.character, as.factor)
levels(BEC.test$BEC.pred) <- levels(BEC.test$BEC)
BEC_accuracy <- BEC.test %>%                   # test set predictions
  accuracy(truth = Class, BEC.pred)
 table(BEC_accuracy)
```

```{r review misassigned SS}

```

```{r predict placement of new SS and plots}


```