-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCapstone_Two_Report_Haslam_2019_03_12.Rmd
1670 lines (1211 loc) · 86.7 KB
/
Capstone_Two_Report_Haslam_2019_03_12.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Capstone Two: 21 models tested on the WDBC data"
author: "Thomas J. Haslam"
date: "March 12, 2019"
output:
html_document:
pandoc_args: --number-sections
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.fullwidth = TRUE, fig.align = "center")
```
# Overview
This project uses the well-known Breast Cancer Wisconsin (Diagnostic) Data Set ([WDBC](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))), available from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php), Center for Machine Learning and Intelligent Systems, University of California, Irvine.[^1]
```{r lib_data}
library(tidyverse)
# For RMD use this instead of standard library()
if (!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if (!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if (!require(matrixStats)) install.packages("matrixStats", repos = "http://cran.us.r-project.org")
if (!require(readr)) install.packages("readr", repos = "http://cran.us.r-project.org")
if (!require(cluster)) install.packages("cluster", repos = "http://cran.us.r-project.org")
if (!require(fpc)) install.packages("fpc", repos = "http://cran.us.r-project.org")
if (!require(utils)) install.packages("utils", repos = "http://cran.us.r-project.org")
#library(tidyverse)
#library(caret)
#library(readr)
#library(matrixStats)
#library(utils)
options(scipen = 999) # no natural log, please
```
## Data Set Characteristics
The data set is multivariate, consisting of 569 observations with 32 attributes (variables), with no missing values. It generally conforms to the [tidy format](https://en.wikipedia.org/wiki/Tidy_data):
each variable, a column; each observation, a row; each type of observational unit, a table.[^2]
```{r Data_Wrangle}
# Data Import & Wrangle(1)
# First Problem: Set Data Variable Ids (column names) deriveed from wdbc.names.txt
name_cols <- c("id","diagnosis","radius_mean","texture_mean",
"perimeter_mean","area_mean","smoothness_mean",
"compactness_mean","concavity_mean","concave_points_mean",
"symmetry_mean","fractal_dimension_mean","radius_se","texture_se",
"perimeter_se","area_se","smoothness_se","compactness_se",
"concavity_se","concave_points_se","symmetry_se","fractal_dimension_se",
"radius_worst","texture_worst","perimeter_worst",
"area_worst","smoothness_worst","compactness_worst",
"concavity_worst","concave_points_worst","symmetry_worst",
"fractal_dimension_worst")
# Read in UIC data with column names
csv_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
wdbc_data <- read_csv(csv_url, col_names = name_cols)
# as.factor and set levels
wdbc_data <- mutate_if(wdbc_data, is.character, as.factor) %>%
mutate_at("diagnosis", factor, levels = c("M", "B")) # Set malignant as POSITIVE
## EDA Jazz
wdbc_mx <- as.matrix(wdbc_data[, 3:32]) # remove id & diagnosis
# Set the row names
row.names(wdbc_mx) <- wdbc_data$id
# Recapture as df using Tidyverse
tidy_2 <- bind_cols(enframe(names(wdbc_data[, 3:32]),
name = NULL, value = "Variable"),
enframe(colMeans2(wdbc_mx),
name = NULL, value = "Avg") ,
enframe(colSds(wdbc_mx),
name = NULL, value = "SD"),
enframe(colMins(wdbc_mx),
name = NULL, value = "Min"),
enframe(colMaxs(wdbc_mx),
name = NULL, value = "Max"),
enframe(colMedians(wdbc_mx),
name = NULL, value = "Median"))
true_values <- as.factor(wdbc_data$diagnosis) %>%
relevel("M") %>% set_names(wdbc_data$id) # Set malignant as POSITIVE
```
The first two variables are the ID number and Diagnosis ("M"" = malignant, "B" = benign). The next 30 variables describe ten features of each cell nucleus examined: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension. Since the overall concern is predicting whether a cell is malignant or benign, the primary ML task is classification. Of the 569 observations, for the true values the class distribution is 357 benign ("B"), 212 malignant ("M").[^3]
```{r EDA_start}
set.seed(2019)
unscaled_K <- kmeans(wdbc_mx, centers = 2, nstart = 20)
wdbc_mx_sc <- scale(sweep(wdbc_mx, 2, colMeans(wdbc_mx))) # center & scale
# sweep allows for centering (and scaling) with arbitrary statistics.
# but `scale(wdbc_mx, center = TRUE)` would work just as well in this case
set.seed(2019)
scaled_K <- kmeans(wdbc_mx_sc, centers = 2, nstart = 20)
unscaled_k_pred <- if_else(unscaled_K$cluster == 1, "M", "B") %>%
as.factor() %>% relevel("M") %>% set_names(wdbc_data$id)
scaled_k_pred <- if_else(scaled_K$cluster == 1, "B", "M") %>%
as.factor() %>% relevel("M") %>% set_names(wdbc_data$id)
ID_check <- rbind(
rbind("True_Values" = true_values[90:97]),
rbind("Unscaled_K" = unscaled_k_pred[90:97]),
rbind("Scaled_K" = scaled_k_pred[90:97])
)
cfm_unscaled_k <- confusionMatrix(unscaled_k_pred, true_values)
cfm_scaled_k <- confusionMatrix(scaled_k_pred, true_values)
# key values as table output
key_values_K_cluster <- bind_cols(
enframe(cfm_unscaled_k$overall["Accuracy"], name = NULL, value = "unK_Acc" ),
enframe(cfm_unscaled_k$byClass["F1"], name = NULL, value = "unK_F1" ) ,
enframe(cfm_scaled_k$overall["Accuracy"], name = NULL, value = "scalK_Acc " ),
enframe(cfm_scaled_k$byClass["F1"], name = NULL, value = "scalK_F1" ) ) %>%
knitr::kable(caption = "Unscaled and Scaled K: Accuracy and F Measure results")
dirty_comp_table <- cbind(cbind(TrV = table(true_values)),
cfm_unscaled_k$table,
cfm_scaled_k$table) %>%
knitr::kable(caption = "L-R: True Values, Unscaled K, Scaled K" )
diagno <- as.numeric(wdbc_data$diagnosis == "M") # for plotting
wdbc_PCA <- prcomp(wdbc_mx, center = TRUE, scale = TRUE)
importance_df <- data.frame(Sum_Exp = summary(wdbc_PCA)$importance[3,]) %>%
rownames_to_column("PCA") # Cumulative Proportion
PCA_sum <- summary(wdbc_PCA) # PCA list: SD, Prop Var., Cum Prop.
# PCA_sum$importance[3,] == summary(wdbc_PCA)$importance[3,]
plot_PCA1_2 <- data.frame(PC1 = wdbc_PCA$x[,1], PC2 = wdbc_PCA$x[,2],
label = factor(wdbc_data$diagnosis )) %>%
ggplot(aes(PC1, PC2, fill = label)) +
geom_point(cex = 3, pch = 21) +
labs(fill = "Class", title = "True Value Groupings: PC1 / PC2",
subtitle = "63% of variance explained") +
theme(legend.position = c(0.88, 0.14))
plot_PCA4_5 <- data.frame(PC4 = wdbc_PCA$x[,4], PC5 = wdbc_PCA$x[,5],
label = factor(wdbc_data$diagnosis )) %>%
ggplot(aes(PC4, PC5, fill = label)) +
geom_point(cex = 3, pch = 21) +
labs(fill = "Class", title = "True Value Groupings: PC4 / PC5",
subtitle = "12% of variance explained") +
theme(legend.position = c(0.88, 0.14))
```
For modelling and evaluation purposes, I have set "M" or the diagnosis of malignant as the positive value, but with `trainControl` for `caret` set as `summaryFunction = twoClassSummary`. As a result, the accuracy scores will reflect both results for "M" and "B" based on their respective true values, and not the just correct percentage of results for "M".
## Project Goals
This project uses an ensemble approach to evaluate 21 different ML alogorithms (models) as follows: the models are evaluated for **Accuracy** and **F Measure** (which considers both precision and recall)[^4] on two different training and test splits, and three different preprocesing routines.
For the train/test splits, the first run: 50/50. The second: 82/18. For the three different preprocessing routines, Prep_0 (or NULL) uses no preprocessing. Prep_1 centers and scales the data. Prep_2 uses Principal Component Analysis (PCA), after first removing any near-zero-variant values (NZV) and centering and scaling the data. (In other words, Prep_2 uses a standard stack for the Caret package preProcessing option: in this case, "`nzv`, `center`, `scale`, `pca`"; likewise common, "`zv`, `center`, `scale`, `pca`").[^5]
### Research Questions
In total, each of the 21 models is run 6 times (2 splits; 3 preps each) for a total of 126 prediction/classification results. In evaluating the results, the relevant questions are as follows: Which models perform best overall? How do the different preprocessing routines affect each model? Which models deal best with limited data (the 50/50 split)? Which models learn the best (show significant improvement) when given more data (the 82/18 split)?
#### Ensuring Reproducibility
To ensure both valid comparisions between models and reproducibility of results, all models (for all runs and all preprocessing routines) share the same `caret` command for `trainControl` as follows:
```{r example, echo = TRUE, eval = FALSE}
set.seed(2019)
seeds <- vector(mode = "list", length = 1000)
for(i in 1:1000) seeds[[i]] <- sample.int(1000, 800)
### Does NOT change: same for all models
myControl <- trainControl(
method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = FALSE,
seeds = seeds
)
```
Please note that when using `caret` training multiple models , the `seeds` option must be set in `trainControl`[^6] to ensure reproducibility of results: it is not enough to `set.seed()` outside of the modelling function. Otherwise, each time `caret` performs cross validation (or bootstrapping or sampling), it will randomly select different observation rows from the training data.
### Failure Tables
Finally, *Failure Tables* were generated for each set of model results. The relevant questions here are as follows: Where the failures in diagnosis random? Or, were they specific to certain cell nucleus observerations as determined by the id variable? If so, were these simply much more challenging for the models in general? Or, for certain types of models (for example, *linear* vs. *random forest* models)? Did data preprocessing, or the lack thereof, contribute classification failure on particular observations.
The *Failure Tables* allow us to drill down in much more detail, if needed. For this project, it might seem overkill. But in the context of medical research, it can be highly valuable to identify which set of characteristics typically get misdiagnosed. The *Failure Tables* are a useful step in that direction, and in developing better models or refining existing ones.
# Methods / Analysis
The project starts with *EDA*, exploring the data particularly for variance. This will provide insight into what data preprocessing might offer (if anything) for modelling.
The modelling itself involves supervised learning[^7], testing the results against known true values. But the *EDA* stage makes use of unsupervised learning[^8], particularly cluster analysis, to explore how data processing might affect outcomes. (I learned this approach from Hank Roark's course "Unsupervised Learning in R"[^9] at [DataCamp](https://www.datacamp.com)).
The models are evaluated in terms of their *Accuracy* and *F Measure* scores. When the only the accuracy results appear in a table, they are ranked in descending order (top to bottom) in accordance with the matching *F Measure* results.
The model selection itself was based largely on what I learned in the *Harvard edX Machine Learning course*[^10], with two algorithms eliminated during the development process: `gam` and `wsrf`. The baseline *Generalized Additive Model*[^11] `gam`, unlike `gamboost` and `gamLoess`, simply took too long to run and returned generally inferior results. The *Weighted Subspace Random Forest for Classification*[^12] `wrfs` also returned consistently inferior results, and I had other more commonly used *random forest*[^13] models as valid alternatives.
## EDA: Base R, Tidyverse, Unsupervised ML
*Exploratory Data Analysis* (EDA), obviously, should precede data modelling. If we take the old school approach, we might cast the dataframe `wdbc_data` into a matrix `wdbc_mx` and then check (to start) the mean and sd of each predictor variable with code such as `colMeans2(wdbc_mx) ` or `apply(wdbc_mx, 2, sd)` This returns useful but messy output as follows:
```{r messy_output, echo = TRUE}
apply(wdbc_mx, 2, mean) # mean per variable
```
We see a high degree of variance, which makes a case for centering and scaling the data, as I will demonstrate shortly. But it also might be better to capture this data as `tidyverse` summary statistics. The `tidy` output as follows, showing only the first 6 rows for sake of brevity:
```{r tidy_output}
tidy_2 %>% head() %>%
knitr::kable(caption = "wdbc: Summary Stats [first 6 variables shown] ")
```
Moreover, capturing the data makes it easier for us to the summarize the summary stats.
```{r tidy_sum_stats}
tidy_2 %>% summarize(min_mean = min(Avg),
max_mean = max(Avg),
avg_mean = mean(Avg),
sd_mean = sd(Avg),
min_sd = min(SD),
max_sd = max(SD),
avg_sd = mean(SD),
sd_sd = sd(SD)) %>%
knitr::kable(caption = "wdbc: Range of mean & sd values")
```
### High Variance: A Case for Centering and Scaling?
Examining the variance in the data set for all predictor values, we can see the SD of the mean is over 3 times the average of the mean, and the range runs from approximately 0.004 min to 880 max. The SD stats likwise show a considerable range.
To demonstrate how this much variance could affect ML classification, I will use `kmeans` do unsupervised cluster analysis first on the raw data, and then on the centered and scaled data.
### Unsupervised ML Exploration
In this exercise, we are simply trying to identify distinguishable clusters in the data. We are not (yet) predicting the classification of "M" (malignant) or "B" (benign). Rather, we want to know what general groupings or patterns occur. I will cheat a little, however, and set the number of clusters to 2. So the question then becomes which approach, no data preprocessing or centering and scaling the data, brings us closer the to right numbers for each group: 357 "B" (benign) and 212 "M" (malignant).
The results from `kmeans(wdbc_mx, centers = 2, nstart = 20)` have been saved as `unscaled_K`, plotted as follows:
```{r unscaled_k}
plotcluster(wdbc_mx, unscaled_K$cluster, main = "Unscaled K Results",
ylab = "", xlab = "Cluster 1: 131 assigned; Cluster 2: 438")
```
Running either `table(unscaled_K$cluster)` or `unscaled_K$size` gives us the breakdown of K1:131; K2: 438. If we assume that the larger cluster 2 is "B" and so cluster 1 "M", an assumption we will test shortly, then `unscaled_K` has failed to idenfity a miminum of 81 malignant cells. This number rises if cluster 2 contains cells that should have been diagnosed as benign but instead were diagnosed as malignant. So our best possible accuracy rate is 0.857645 (488/569).
Let's test whether centering and scaling the data has a meaningful impact on the outcome. For comparison purposes, we will create a second matrix: `wdbc_mx_sc <- scale(sweep(wdbc_mx, 2, colMeans(wdbc_mx)))`.
#### Unprocessed vs. Centered and Scaled
```{r comparison_un_vs_scaled, echo = FALSE}
t(wdbc_mx)[1:8, 1:5] %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
knitr::kable(caption = "Raw Data: wdbc_mx: First 8 Vars. for 5 Cases")
t(wdbc_mx_sc)[1:8, 1:5] %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
knitr::kable(caption = "C-S Prep: wdbc_mx_sc: First 8 Vars. for 5 Cases")
```
With the predictor variable values centered (on zero) and scaled (now measured in SDs from the mean), we will rerun `kmeans` now on `wdbc_mx_sc` and save the results as `scaled_K`.
```{r scaled_K}
plotcluster(wdbc_mx, scaled_K$cluster,
main = "Centered & Scaled K Results",
ylab = "", xlab = "Cluster 1: 380 assigned; Cluster 2: 189")
```
Running either `table(scaled_K$cluster)` or `scaled_K$size` gives us the breakdown of K1: 380; K2: 189. In contrast to `unscaled_K`, and again an assumption we will test shortly, if we assume that the larger cluster 1 is "B" and so cluster 2 "M", then `scaled_K` has failed to idenfity a miminum of 23 malignant cells. So now our best possible accuracy rate is 0.9595782 (546/569), but likely lower.
Time to test our assumptions.
#### Check against true values
So we will convert our *unsupervised learning* run into a raw *supervised learning* attempt. First, I will establish the vector of true values; second, do a quick check to make sure the matrix id numbers match the true value id numbers; and then finally run a `confusionMatrix` test for the *Accuracy* and *F Measure* scores
```{r assumption_check, eval = FALSE, echo = TRUE}
true_values <- as.factor(wdbc_data$diagnosis) %>%
relevel("M") %>% set_names(wdbc_data$id) # Set malignant as POSITIVE
# Assign clusters to most likely true value classifications
unscaled_k_pred <- if_else(unscaled_K$cluster == 1, "M", "B") %>%
as.factor() %>% relevel("M") %>% set_names(wdbc_data$id) # match id to index#
scaled_k_pred <- if_else(scaled_K$cluster == 1, "B", "M") %>%
as.factor() %>% relevel("M") %>% set_names(wdbc_data$id) # match id to index#
```
The following table results from comparing `true_values[90:97]` , `unscaled_k_pred[90:97]`, and `scaled_k_pred[90:97]`, making sure we are indeed testing the same id numbers.
```{r matrix_check, echo = FALSE}
ID_check %>%
knitr::kable(caption = "IDs and Indexs Match: sample[90:97]" )
```
So far, so good. The ID numbers match the index order for both K runs, and `scaled_k` appears a bit more accurate than `unscaled_k`. Let's run the `confusionMatrix` for both, check the key results, and wrap up this EDA experiment.
#### confusionMatrix Results
```{r cfm_for_K_tests}
bind_cols(
enframe(cfm_unscaled_k$overall["Accuracy"], name = NULL, value = "unK_Acc" ),
enframe(cfm_unscaled_k$byClass["F1"], name = NULL, value = "unK_F1" ) ,
enframe(cfm_scaled_k$overall["Accuracy"], name = NULL, value = "scalK_Acc " ),
enframe(cfm_scaled_k$byClass["F1"], name = NULL, value = "scalK_F1" )
) %>% knitr::kable(caption = "Unscaled and Scaled K: Accuracy and F Measure results")
```
By centering and scaling the data, we improved the *Accuracy* by over 5% and the *F Measure* by over 11%. Importantly, in this context, the `scaled_K` model did much better as not mistaking malignant cells for benign: the potentially more dangerous outcome.
* `scaled_K`: 51 total failures versus 83 for `unscaled_K`.
* `scaled_K': 45 more malignant cells correctly identified.
```{r K_comp_table}
# quick and dirty comp table
cbind(cbind(TrV = table(true_values)),
cfm_unscaled_k$table,
cfm_scaled_k$table) %>%
knitr::kable(caption = "L-R: True Values, Unscaled K, Scaled K" )
# 83 total vs 51 total
# 45 more malignant cells correctly id
```
So proof of concept. For some ML approaches, as we learned in the *Harvard edx* ML course, centering and scaling the data yields better results.
Thus far, we have two preparations. The null model, so to speak: no preprocessing. And the first, centering and scaling. This brings us to another common data preprocessing routine for ML, *Principal Component Analysis* (PCA).[^5]
### Principal Component Analysis (PCA)
*PCA* is particularly well-suited to dealing with large data sets that have many observations and likely collinearity among (at least some of) the predictor variables. *PCA* helps deal with the "curse of dimensonality"[^14] by using an orthogonal transformation to produce "a set of values of linearly uncorrelated variables" or principal components.[^15] Typically, if *PCA* is a good fit, only small number of principal components will be needed to explain the vast majority of variance in the data set.
The `wdbc_data` does have 30 predictor variables for describing 10 features, and it likewise seems correlation must exist among such variables as `area_mean`, `radius_mean`, and `perimeter_mean`. But the data set only has 569 observations. So *PCA* might not add as much value as it would for a larger data set.
But because this is a experiment in and a report on model testing, we will include *PCA* as a data preprocessing routine.
#### PCA Biplot
For EDA purposes, running `wdbc_PCA <- prcomp(wdbc_mx, center = TRUE, scale = TRUE)` returns the following biplot:
```{r PCA_1, echo = TRUE}
biplot(wdbc_PCA, cex = 0.45)
```
The closer the lines are, the more correlated the variables. To no surprise, `fractal_dimension_mean` (bottom-center) and `radius_mean` (top-left) are at a near 90 degrees angle: no meaningful collinearity. In contrast, `radius_mean` and `area_mean` (both top-left) are so correlated that the lines and labels almost merge.
The table below samples the first 5 principal components, showing how each has transformed the first eight predictor variables in the original data set.
```{r PCA_first_5}
wdbc_PCA$rotation[1:8, 1:5] %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
knitr::kable(caption = "PCA Matrix: First 8 Variables for First 5 PC")
```
#### Cummulative Variance Explained
If we plot the PCA summary results, we see that these same 5 principal components explain nearly 85% of the data variance, and the first 10 principal components explain over 95%.
```{r PCA_2}
graph_PCA <- importance_df[1:12, ] %>%
ggplot( aes(reorder(PCA, Sum_Exp), Sum_Exp)) +
geom_point() +
labs(title = "PCA Results: 10 components explain over 95% variance",
x = "Principal Component Analysis: 1-12 of 30" ,
y = "Variance Explained", subtitle = "WDBC (Wisconsin Diagnostic Breast Cancer) data set") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
geom_hline(aes(yintercept = 0.95), color = "orange", linetype = 2)
graph_PCA
```
The table below provides the detailed summary results for the same first 5 principal components.
```{r PCA_chart}
PCA_sum[["importance"]][, 1:5] %>%
as.data.frame() %>%
rownames_to_column("Results") %>%
knitr::kable(caption = "PCA Summary: First 5 Components")
```
##### PC1 & PC2 vs. PC4 & PC5
Between them, PC1 and PC2 explain 63% of the data variance; PC4 and PC5, 12%. We can plot the data directly against both pairs, as follows:
```{r PCAmulti_plot, warning = FALSE}
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Winston Chang
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots == 1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
multiplot(plot_PCA1_2,plot_PCA4_5 , cols = 2)
```
To no surprise, for PC1 and PC2, which are orthogonally opposed to each other, we have two fairly distinct groupings; but for PC4 and PC5, which explain considerably less variance and by necessity are closer, display no such clear groupings.
#### PCA Summary
The law of diminishing returns sets in quickly with *PCA*. It breaks the `wdbc_data` down into 30 principal components, but by PC17, `0.9911300` of the variance is explained: for all practical purposes, PC18 to PC30 add nothing and could be safely dropped from modelling.
In principle, this data set (if larger) should strongly benefit from PCA preprocessing.
## Modelling
As discussed above in the **Overview**, this project uses an ensemble of twenty-one ML alogorithms (models), on two different training and test splits and with three different data preprocesing routines. So each model is run six times total, and the results are evaluated for **Accuary** (correctly predicting "M" or malignant) and **F Measure**.
```{r Model_Runs_All }
############################# 50/50 train/test split
Y <- wdbc_data$diagnosis
set.seed(2019)
test_index <- createDataPartition(Y, times = 1, p = 0.5, list = FALSE)
# Apply index
test_set_id <- wdbc_data[test_index, ] # will use ID later
train_set_id <- wdbc_data[-test_index, ]
# Remove id for testing
test_set <- test_set_id %>% select(-id)
train_set <- train_set_id %>% select(-id)
set.seed(2019)
test_index2 <- createDataPartition(Y, times = 1, p = 0.18, list = FALSE)
# Apply index
test2_id <- wdbc_data[test_index2, ]
train2_id <- wdbc_data[-test_index2, ]
# Remove id variable
test_2 <- test2_id %>% select(-id)
train_2 <- train2_id %>% select(-id)
######## trainControl and Data Preparations for all models
## NOTE: To have reproducible results for ensemble modeling, must use seeds argument in trainControl
set.seed(2019)
seeds <- vector(mode = "list", length = 1000)
for (i in 1:1000) seeds[[i]] <- sample.int(1000, 800)
### Re-usable trainControl for consistent comparison across models
#
### Does NOT change: same for all models
myControl <- trainControl(
method = "cv", number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
seeds = seeds
)
##
# Experiement with preprocessing: one NULL, two typical
##
prep_0 <- NULL
prep_1 <- c("center", "scale")
prep_2 <- c("nzv", "center", "scale", "pca")
###
## Select Models for Ensemble: 21
###
models <- c("adaboost", "avNNet", "gamboost", "gamLoess", "glm",
"gbm", "knn", "kknn", "lda", "mlp", "monmlp", "naive_bayes",
"qda", "ranger", "Rborist", "rf", "rpart", "svmLinear", "svmRadial",
"svmRadialCost", "svmRadialSigma")
mod_names <- enframe(models, value = "Model", name = NULL)
###############################################################################
## SAVE POINT:
## NEXT: Modelling and Results
###############################################################################
#
#
######################################## First Run : 50/50 ############################
set.seed(2019)
garbage_0 <- capture.output(
fits_0 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_set, method = model,
trControl = myControl, preProcess = prep_0)
})
)
names(fits_0) <- models
# Predictions 0
predictions_0 <- sapply(fits_0, function(object)
predict(object, newdata = test_set))
# Predictions for CFM & F Measure
pred_ft_0 <- predictions_0 %>% as.data.frame() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M", "B")) # Set malignant as POSITIVE
# Confusion Matrix for Prep_0
CFM_Prep_0 <- sapply(pred_ft_0 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_set$diagnosis)
list(CFM)
})
########### Quick and Dirty extract for all CFM Lists!
ACC_dex <- c(6, 30, 54, 78, 102, 126, 150,174 ,198 ,222 ,246,
270 ,294 ,318 ,342, 366, 390, 414, 438, 462, 486) # Accuracy score
F1_dex <- c(19, 43, 67, 91, 115, 139, 163, 187, 211, 235, 259, 283,
307, 331, 355, 379, 403, 427, 451, 475, 499) # F1 score
############
CFM_mess_0 <- CFM_Prep_0 %>% unlist() %>% as.data.frame() # create an ordered mess
CFM_0_Keys <- bind_cols(mod_names,
Accuracy = round(as.numeric(as.character(CFM_mess_0[ACC_dex,])),4) ,
F_Measure = round(as.numeric(as.character(CFM_mess_0[F1_dex,])),4)
) %>%
mutate(Total = Accuracy + F_Measure) # grab values: convert from factor to numeric; round
#
## Prep_1 center, scale
set.seed(2019)
garbage_1 <- capture.output(
fits_1 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_set, method = model,
trControl = myControl, preProcess = prep_1)
})
)
names(fits_1) <- models
##
# Predictions
predictions_1 <- sapply(fits_1, function(object)
predict(object, newdata = test_set))
# Predictions for CFM & F Measure
pred_ft_1 <- predictions_1 %>% as.data.frame() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M" , "B"))
# Confusion Matrix List for Prep_1
CFM_Prep_1 <- sapply(pred_ft_1 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_set$diagnosis)
list(CFM)
})
CFM_mess_1 <- CFM_Prep_1 %>% unlist() %>% as.data.frame() # mess!
CFM_1_Keys <- bind_cols(mod_names,
Accuracy = round(as.numeric(as.character(CFM_mess_1[ACC_dex,])), 4 ) ,
F_Measure = round(as.numeric(as.character(CFM_mess_1[F1_dex,])), 4 )
) %>%
mutate(Total = Accuracy + F_Measure)
#
#
## Prep 2: nzv, center, scale, pca
set.seed(2019)
garbage_2 <- capture.output(
fits_2 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_set, method = model,
trControl = myControl, preProcess = prep_2)
})
)
names(fits_2) <- models
# Predictions
predictions_2 <- sapply(fits_2, function(object)
predict(object, newdata = test_set))
pred_ft_2 <- predictions_2 %>% as_tibble() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M" , "B"))
# Confusion Matrix for Prep_2
CFM_Prep_2 <- sapply(pred_ft_2 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_set$diagnosis)
list(CFM)
})
CFM_mess_2 <- CFM_Prep_2 %>% unlist() %>% as.data.frame()
CFM_2_Keys <- bind_cols(mod_names,
Accuracy = round(as.numeric(as.character(CFM_mess_2[ACC_dex,])), 4),
F_Measure = round(as.numeric(as.character(CFM_mess_2[F1_dex,])), 4)
) %>%
mutate(Total = Accuracy + F_Measure)
set.seed(2019)
garbage_3 <- capture.output(
fits_3.0 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_2, method = model,
trControl = myControl, preProcess = prep_0)
})
)
names(fits_3.0) <- models
# Predictions
predictions_3.0 <- sapply(fits_3.0, function(object)
predict(object, newdata = test_2))
pred_ft_3.0 <- predictions_3.0 %>% as_tibble() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M", "B"))
# Confusion Matrix for Prep_0
CFM_Prep_3.0 <- sapply(pred_ft_3.0 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_2$diagnosis)
list(CFM)
})
CFM_mess_3.0 <- CFM_Prep_3.0 %>% unlist() %>% as.data.frame()
CFM_3.0_Keys <- bind_cols(mod_names,
Accuracy = round(as.numeric(as.character(CFM_mess_3.0[ACC_dex,])), 4),
F_Measure = round(as.numeric(as.character(CFM_mess_3.0[F1_dex,])), 4)
) %>%
mutate(Total = Accuracy + F_Measure)
#
## Prep_1 model center, scale
#
set.seed(2019)
garbage_3.1 <- capture.output(
fits_3.1 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_2, method = model,
trControl = myControl, preProcess = prep_1)
})
)
names(fits_3.1) <- models
# Predictions Prep_1
predictions_3.1 <- sapply(fits_3.1, function(object)
predict(object, newdata = test_2))
pred_ft_3.1 <- predictions_3.1 %>% as_tibble() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M", "B"))
# Confusion Matrix for Prep_1
CFM_Prep_3.1 <- sapply(pred_ft_3.1 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_2$diagnosis)
list(CFM)
})
CFM_mess_3.1 <- CFM_Prep_3.1 %>% unlist() %>% as.data.frame()
CFM_3.1_Keys <- bind_cols(mod_names,
Accuracy = round(as.numeric(as.character(CFM_mess_3.1[ACC_dex,])), 4) ,
F_Measure = round(as.numeric(as.character(CFM_mess_3.1[F1_dex,])), 4)
) %>%
mutate(Total = Accuracy + F_Measure)
#
## Prep_2 model nzv, center, scale, pca
#
set.seed(2019)
garbage_3.2 <- capture.output(
fits_3.2 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_2, method = model,
trControl = myControl, preProcess = prep_2)
})
)
names(fits_3.2) <- models
# Predictions Prep_2
predictions_3.2 <- sapply(fits_3.2, function(object)
predict(object, newdata = test_2))
pred_ft_3.2 <- predictions_3.2 %>% as_tibble() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M", "B"))
# Confusion Matrix for Prep_2
CFM_Prep_3.2 <- sapply(pred_ft_3.2 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_2$diagnosis)
list(CFM)
})
CFM_mess_3.2 <- CFM_Prep_3.2 %>% unlist() %>% as.data.frame()
CFM_3.2_Keys <- bind_cols(mod_names,
Accuracy = round(as.numeric(as.character(CFM_mess_3.2[ACC_dex,])), 4) ,
F_Measure = round(as.numeric(as.character(CFM_mess_3.2[F1_dex,])), 4)
) %>%
mutate(Total = Accuracy + F_Measure)
rm(garbage_0, garbage_1, garbage_2, garbage_3, garbage_3.1, garbage_3.2)
############################### Results #########################################
#
# Run One: 50/50
Accuracy_Table_1 <- bind_cols(Model = mod_names,
Acc_0 = CFM_0_Keys$Accuracy,
F1_0 = CFM_0_Keys$F_Measure,
Acc_1 = CFM_1_Keys$Accuracy,
F1_1 = CFM_1_Keys$F_Measure,
Acc_2 = CFM_2_Keys$Accuracy,
F1_2 = CFM_2_Keys$F_Measure) %>%
mutate(Top_PreProcess = (Acc_1 + Acc_2) / 2,
Top_Overall = (Acc_0 + Acc_1 + Acc_2) / 3)
## Averages
h_line_Acc_0 <- mean(Accuracy_Table_1$Acc_0)
h_line1_Acc_1 <- mean(Accuracy_Table_1$Acc_1)
h_line2_Acc_2 <- mean(Accuracy_Table_1$Acc_2)
Accuracy_Run_One_Viz <- Accuracy_Table_1 %>%
ggplot(aes(Model, Acc_0)) +
geom_jitter(color = "red", alpha = 0.6, width = 0.44, height = -0.1) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_jitter(aes(y = Acc_1), color = "blue",
alpha = 0.6, width = 0.5, height = 0) +
geom_jitter(aes(y = Acc_2), color = "green" , alpha = 0.6, width = 0.44, height = 0) +
geom_hline(yintercept = h_line1_Acc_1, linetype = 2, color = "blue", alpha = 0.3) +
geom_hline(yintercept = h_line_Acc_0, linetype = 2, color = "red", alpha = 0.3) +
geom_hline(yintercept = h_line2_Acc_2, linetype = 2, color = "green", alpha = 0.5) +
labs(title = "All Models: Accuracy Scores: 50/50 Split",
subtitle = "Prep by color: Red 0; Blue 1; Green 2",
y = "Accuracy Rate", caption = "H-lines = Prep avg.")
## Replot
Accuracy_Table_1a <- Accuracy_Table_1
Accuracy_Table_1a$Acc_0[10] <- NA # induce NA to remove mlp outlier
h_line_Acc_0_check <- mean(Accuracy_Table_1a$Acc_0, na.rm = TRUE) # without MLP_0
Accuracy_Run_One_reViz <- Accuracy_Table_1a %>%
ggplot(aes(Model, Acc_0)) +
geom_jitter(color = "red", alpha = 0.6, width = 0.4, height = 0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_jitter(aes(y = Acc_1), color = "blue",
alpha = 0.6, width = 0.5, height = 0) +
geom_jitter(aes(y = Acc_2), color = "green", alpha = 0.6, width = 0.4, height = 0) +
geom_hline(yintercept = h_line_Acc_0, linetype = 2, color = "red", alpha = 0.3) +
geom_hline(yintercept = h_line1_Acc_1, linetype = 2, color = "blue", alpha = 0.3) +
geom_hline(yintercept = h_line2_Acc_2, linetype = 2, color = "green", alpha = 0.5) +
geom_hline(yintercept = h_line_Acc_0_check , linetype = 2, color = "orange", alpha = 0.5) +
labs(title = "All Models: Accuracy Scores: 50/50 Split",
subtitle = "Prep by color: Red 0; Blue 1; Green 2",
y = "Accuracy Rate",
caption = "H-lines = Prep avg.; Prep_0 for MLP not plotted: 0.6281")
## Top Seven Models Per Prep, Run One
Top_Seven_0 <- Accuracy_Table_1 %>% arrange(desc(Acc_0)) %>%
select(Model_0 = Model, Prep_Null = Acc_0) %>% slice(1:7)
Top_Seven_1 <- Accuracy_Table_1 %>% arrange(desc(Acc_1)) %>%
select(Model_1 = Model, Prep_1 = Acc_1) %>% slice(1:7)
Top_Seven_2 <- Accuracy_Table_1 %>% arrange(desc(Acc_2)) %>%
select(Model_2 = Model, Prep_2 = Acc_2) %>% slice(1:7)
Top_Overall <- Accuracy_Table_1 %>% arrange(desc(Top_Overall)) %>%
select(Model_Overall = Model, Avg_Acc = Top_Overall) %>% slice(1:7)
Top_Seven_50_Split <- bind_cols(Top_Seven_0, Top_Seven_1,
Top_Seven_2, Top_Overall)
Accuracy_Table_2 <- bind_cols(Model = mod_names,
Acc_3.0 = CFM_3.0_Keys$Accuracy,
F1_3.0 = CFM_3.0_Keys$F_Measure,
Acc_3.1 = CFM_3.1_Keys$Accuracy,
F1_3.1 = CFM_3.1_Keys$F_Measure,
Acc_3.2 = CFM_3.2_Keys$Accuracy,
F1_3.2 = CFM_3.2_Keys$F_Measure) %>%
mutate(Top_PreProcess = (Acc_3.1 + Acc_3.2) / 2,
Top_Overall = (Acc_3.0 + Acc_3.1 + Acc_3.2) / 3)
# Remove mlp NULL for chart
Accuracy_Table_2a <- Accuracy_Table_2
Accuracy_Table_2a$Acc_3.0[10] <- NA # induce NA
h_line_Acc_3.0 <- mean(Accuracy_Table_2$Acc_3.0)
h_line_Acc_3.1 <- mean(Accuracy_Table_2$Acc_3.1)
h_line_Acc_3.2 <- mean(Accuracy_Table_2$Acc_3.2)
h_line_check_3.0 <- mean(Accuracy_Table_2a$Acc_3.0,
na.rm = TRUE ) # remove mlp
Accuracy_Run_Two_reViz <- Accuracy_Table_2a %>%
ggplot(aes(Model, Acc_3.0)) +
geom_jitter(color = "red", alpha = 0.6, width = 0.4, height = 0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_jitter(aes(y = Acc_3.1), color = "blue",
alpha = 0.6, width = 0.5, height = 0) +
geom_jitter(aes(y = Acc_3.2), color = "green", alpha = 0.6, width = 0.4, height = 0) +
geom_hline(yintercept = h_line_Acc_3.0, linetype = 2, color = "red", alpha = 0.3) +
geom_hline(yintercept = h_line_Acc_3.1, linetype = 2, color = "blue", alpha = 0.3) +
geom_hline(yintercept = h_line_Acc_3.2, linetype = 2, color = "green", alpha = 0.5) +
geom_hline(yintercept = h_line_check_3.0, linetype = 2, color = "orange", alpha = 0.5) +
labs(title = "All Models: Accuracy Scores: 82/18 Split",
subtitle = "Prep by color: Red 3.0; Blue 3.1; Green 3.2",
y = "Accuracy Rate",
caption = "H-lines = Prep avg.; Prep_3.0 for MLP not plotted: 0.6281")
# Top Seven Models Run Two
Top_Seven_3.0 <- Accuracy_Table_2 %>% arrange(desc(Acc_3.0)) %>%
select(Model_3.0 = Model, Prep_Null = Acc_3.0) %>% slice(1:7)
Top_Seven_3.1 <- Accuracy_Table_2 %>% arrange(desc(Acc_3.1)) %>%
select(Model_3.1 = Model, Prep_1 = Acc_3.1) %>% slice(1:7)
Top_Seven_3.2 <- Accuracy_Table_2 %>% arrange(desc(Acc_3.2)) %>%
select(Model_3.2 = Model, Prep_2 = Acc_3.2) %>% slice(1:7)
Top_Overall_82 <- Accuracy_Table_2 %>% arrange(desc(Top_Overall)) %>%
select(Model_Overall = Model, Avg_Acc = Top_Overall) %>% slice(1:7)
Top_Seven_82_Split <- bind_cols(Top_Seven_3.0, Top_Seven_3.1,
Top_Seven_3.2, Top_Overall_82 )
# Comparing Run Results
Overall_Accuracy_Table <- bind_cols(
Accuracy_Table_1 %>%
select(Model, starts_with("Acc")),
Accuracy_Table_2 %>%
select(starts_with("Acc"))
) %>%
mutate(Top_PreProcess = (Acc_1 + Acc_2 + Acc_3.1 + Acc_3.2) / 4,
Top_Overall = (Acc_0 + Acc_1 + Acc_2 + Acc_3.0 + Acc_3.1 + Acc_3.2) / 6) %>%
arrange(desc(Top_Overall))
```
### Models and Preparations
The data preproccsing commands for `caret`, and the models for the ensemble as below:
```{r modelling_info, echo = TRUE, eval = FALSE}
##
# Experiement with preprocessing: one NULL, two typical
##
prep_0 <- NULL
prep_1 <- c("center", "scale")
prep_2 <- c("nzv", "center", "scale", "pca")
###
## Select Models for Ensemble: 21
###
models <- c("adaboost", "avNNet", "gamboost", "gamLoess", "glm", "gbm",
"knn", "kknn", "lda", "mlp", "monmlp", "naive_bayes", "qda",
"ranger", "Rborist", "rf", "rpart", "svmLinear", "svmRadial",
"svmRadialCost", "svmRadialSigma")
mod_names <- enframe(models, value = "Model", name = NULL)
```
Each model for each run, split, and prep, shares the same `trainControl` as discussed in the **Overview**, with cv (cross validation) set to "10".
The generic function for running the ensemble is as follows:
```{r Model_func, echo = TRUE, eval = FALSE}
set.seed(2019)
fits_1 <- lapply(models, function(model){
print(model)
train(diagnosis ~ ., data = train_set, method = model,
trControl = myControl, preProcess = prep_1)
})
```
`fits_1`, in this case, indicates Run One on the 50/50 split, using `prep_1`: the data centered and scaled. The model results are then used to make predictions against the test set, and the results are saved as a dataframe to generate a `confustionMatrix` per model saved as a large list.
```{r cfm_example, echo = TRUE, eval = FALSE}
# Predictions
predictions_1 <- sapply(fits_1, function(object)
predict(object, newdata = test_set))
# Predictions for CFM & F Measure
pred_ft_1 <- predictions_1 %>% as.data.frame() %>%
mutate_if(., is.character, as.factor) %>%
mutate_all(., factor, levels = c("M" , "B"))
# Confusion Matrix List for Prep_1
CFM_Prep_1 <- sapply(pred_ft_1 , function(object) {
CFM <- confusionMatrix(data = object, reference = test_set$diagnosis)
list(CFM)
})
```
The *Accuracy* and *F Measure* scores per model, and other results as needed, are then pulled from the `confusionMatrix` list: in the example above, `CFM_Prep_1`.
## Train and Test Splitting
The first train/test split is an arbitrary 50/50, intended to examine how well particular models work with limited data. The second train/test split, meant to examine how well the models learn on more data, follows a debatable rule of thumb that the final validation set should "be inversely proportional to the square root of the number of free adjustable parameters."[^16] Since `1/sqrt(30)` rounds out to 18%, leaving 104 observations in the final test set, 82/18 seemed good enough.
```{r train_test, echo = TRUE, eval = FALSE}
# Create first stage train and test sets: 50% for model training; 50% for testing
# Second stage: 82% train; 18 % test
############################# 50/50 train/test split
# which models do well with limited data?
Y <- wdbc_data$diagnosis
set.seed(2019)
test_index <- createDataPartition(Y, times = 1, p = 0.5, list = FALSE)
# Apply index
test_set_id <- wdbc_data[test_index, ] # will use ID later
train_set_id <- wdbc_data[-test_index, ]
# Remove id for testing
test_set <- test_set_id %>% select(-id)
train_set <- train_set_id %>% select(-id)
############################# 82/18 train/test split test_ratio <- 1/sqrt(30)
# Which models have an ML advantage?
set.seed(2019)
test_index2 <- createDataPartition(Y, times = 1, p = 0.18, list = FALSE)
# Apply index
test2_id <- wdbc_data[test_index2, ]
train2_id <- wdbc_data[-test_index2, ]
# Remove id variable
test_2 <- test2_id %>% select(-id)
train_2 <- train2_id %>% select(-id)
```
A basic assumption of ML is that as the amount of training data increases, the model improves in terms of accuracy as it *learns* from the data. As we will see, this does NOT hold true in every case: or, at least, models improve at disparate rates depending on the data characteristics, the preprocessing (if any), and other conditions and constraints.
# Results
## Run One: 50/50 Train/Test
We will start with the big picture for Run One: graphing the accuracy scores for the 50/50 training test split, all data preprocessing routines. Then, I will break down Run One by prep: the preprocessing routine. Finally, I will offer the overall summary statistics.
```{r chart_one_Run_one, warning = FALSE}
Accuracy_Run_One_Viz # includes outlier mlp prep_0
```
### High Performers and One Outlier
Right away, we see wildly different results per model according to the data preparation. In fact, using the *multi-layer perceptron* neural network model[^17], `mlp`, on the raw data (`Prep_0` or no preprocessing) returns an accuracy score equivalent to guessing: to the *No Information Rate*. The confusion matrix results below:
```{r MLP_CFM_0, echo = TRUE}
CFM_Prep_0$mlp
```
Importantly, MLP_0 failed to correctly identify any malignant cells. If `trainControl` were not set to `summaryFunction = twoClassSummary`, the accuracy rate would be 0 (zero). In contrast, MLP_1 (center, scale) and MLP_2 (nzv, center, scale, pca) show significant improvement.
```{r MLP_Run_One}
Accuracy_Table_1 %>% filter(Model == "mlp") %>%
rename("PreProcess_Acc" = Top_PreProcess, "Overall_Acc" = Top_Overall) %>%
knitr::kable(caption = "mlp Results: Run One")
```
In contrast to `mlp` on the raw data, `svmLinear`, `svmRadialCost`, and `svmRadialSigma`, three of the *Support vector-machine*[^18] learning models natively supported by `caret`[^19], do surprising well with no data pre-processing and prove generally robust for all preps tested.
### Other Neural Network Models
But the issue with `mlp` seems not to apply generally to all *neural network* models: two other neural network models, also did well across all preps: `avNNet` (*Neural Networks Using Model Averaging*)[^20] and `monmlp` (*Multi-Layer Perceptron Neural Network with Optional Monotonicity Constraints*).[^21]
```{r Other_Neural_Net_Run_one}
Accuracy_Table_1 %>% filter(Model == "avNNet") %>%
rename("PreProcess_Acc" = Top_PreProcess, "Overall_Acc" = Top_Overall) %>%
knitr::kable(caption = "avNNet Results: Run One")
Accuracy_Table_1 %>% filter(Model == "monmlp") %>%
rename("PreProcess_Acc" = Top_PreProcess, "Overall_Acc" = Top_Overall) %>%