forked from rasbt/python-machine-learning-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
pymle-equations.tex
2336 lines (1607 loc) · 105 KB
/
pymle-equations.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[letterpaper]{report}
\usepackage{hyperref}
\usepackage{fancyhdr}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{enumerate}
\usepackage{caption}
\setlength\parindent{0pt}
% start meta data
\title{Python Machine Learning\\ Equation Reference}
\author{Sebastian Raschka \\ \texttt{[email protected]}}
\date{ \vspace{2cm} 05\slash 04\slash 2015 (last updated: 11\slash 29\slash 2016) \\\begin{flushleft} \vspace{2cm} \noindent\rule{10cm}{0.4pt} \\ Code Repository and Resources:: \href{https://github.com/rasbt/python-machine-learning-book}{https://github.com/rasbt/python-machine-learning-book} \vspace{2cm} \endgraf @book\{raschka2015python,\\
title=\{Python Machine Learning\},\\
author=\{Raschka, Sebastian\},\\
year=\{2015\},\\
publisher=\{Packt Publishing\} \} \end{flushleft}}
% end meta data
% start header and footer
\pagestyle{fancy}
\lhead{Sebastian Raschka}
\rhead{Python Machine Learning -- Equation Reference -- Ch. \thechapter}
\cfoot{\thepage} % centered footer
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\footrulewidth}{0.4pt}
\renewcommand{\chaptermark}[1]{%
{}}
% end header and footer
\begin{document} % start main document
\maketitle
\tableofcontents
%%%%%%%%%%%%%%%
% CHAPTER 1
%%%%%%%%%%%%%%%
\chapter{Giving Computers the Ability to Learn from Data}
\section{Building intelligent machines to transform data into knowledge}
\section{The three different types of machine learning}
\section{Making predictions about the future with supervised learning}
\subsection{Classification for predicting class labels}
\subsection{Regression for predicting continuous outcomes}
\section{Solving interactive problems with reinforcement learning}
\section{Discovering hidden structures with unsupervised learning}
\subsection{Finding subgroups with clustering}
\subsection{Dimensionality reduction for data compression}
\section{An introduction to the basic terminology and notations}
\newpage
The Iris dataset, consisting of 150 samples and 4 features, can then be written as a $150 \times 4$ matrix $\mathbf{X} \in \mathbb{R}^{150 \times 4}:$
\[
\begin{bmatrix}
x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots & x_{4}^{(1)} \\
x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots & x_{4}^{(2)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{1}^{(150)} & x_{2}^{(150)} & x_{3}^{(150)} & \dots & x_{4}^{(150)}
\end{bmatrix}
\]
For the rest of this book, unless noted otherwise, we will use the superscript $(i)$ to refer to the $i$th training sample, and the subscript $j$ to refer to the $j$th dimension of the training dataset.
We use lower-case, bold-face letters to refer to vectors ($\mathbf{x} \in \mathbb{R}^{n \times 1}$) and upper-case, bold-face letters to refer to matrices, respectively ($\mathbf{X} \in \mathbb{R}^{n \times m}$), where $n$ refers to the number of rows, and $m$ refers to the number of columns, respectively. To refer to single elements in a vector or matrix, we write the letters in italics $x^{(n)}$ or $x^{(n)}_{m}$, respectively. For example, $x^{150}_1$ refers to the refers to the first dimension of the flower sample 150, the sepal length. Thus, each row in this feature matrix represents one flower instance and can be written as four-dimensional row vector $\mathbf{x}^{(i)} \in \mathbb{R}^{1 \times 4}$
\[ \mathbf{x}^{(i)} = \bigg[x^{(i)}_1 \; x^{(i)}_2 \; x^{(i)}_3 \; x^{(i)}_4 \bigg]. \]
Each feature dimension is a 150-dimensional column vector $\mathbf{x}_{j} \in \mathbb{R}^{150 \times 1}$, for example
\[
\mathbf{x_j} = \begin{bmatrix}
x_{j}^{(1)} \\
x_{j}^{(2)} \\
\vdots \\
x_{j}^{(150)}
\end{bmatrix}
.\]
Similarly, we store the target variables (here: class labels) as a 150-dimensional column vector
\[
\mathbf{y} = \begin{bmatrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(150)}
\end{bmatrix}
, (y \in \{ \text{Setosa, Versicolor, Virginica \}}).\]
\newpage
\section{A roadmap for building machine learning systems}
\subsection{Preprocessing -- getting data into shape}
\subsection{Training and selecting a predictive model}
\subsection{Evaluating models and predicting unseen data instances}
\section{Using Python for machine learning}
\subsection{Installing Python packages}
\section{Summary}
%%%%%%%%%%%%%%%
% CHAPTER 2
%%%%%%%%%%%%%%%
\chapter{Training Machine Learning Algorithms for Classification}
\section{Artificial neurons -- a brief glimpse into the early history of machine learning}
We can then define an activation function $\phi(z)$ that takes a linear combination of certain
input values $\mathbf{x}$ and a corresponding weight vector $\mathbf{w}$ where $z$ is the so-called net
input ($z = w_1 x_1 + \dots + w_m x_m$):
\[
\mathbf{w} = \begin{bmatrix}
w_{1} \\
w_{2} \\
\vdots \\
w_{m}
\end{bmatrix}, \quad
\mathbf{x} = \begin{bmatrix}
x_{1} \\
x_{2} \\
\vdots \\
x_{m}
\end{bmatrix}.
\]
Now, if the activation of a particular sample $x^{(i)}$, that is, the output of $\phi(z)$, is greater than a defined threshold $\theta$, we predict class 1 and class -1, otherwise. In the perceptron algorithm, the activation function $\phi(\cdot)$ is a simple \textit{unit step function}, which is sometimes also called the \textit{Heaviside step function}:
\[ \phi(z) = \begin{cases}
1 & \text{ if } z \ge \theta \\
-1 & \text{ otherwise }.
\end{cases}
\]
For simplicity, we can bring the threshold $\theta$ to the left side of the equation and define a weight-zero as $w_0 = -\theta$ and $x_0=1$, so that we write $\mathbf{z}$ in a more compact form
\[
z = w_0 x_0 + w_1 x_1 + \dots + w_m x_m = \mathbf{w^T x}
\]
and
\[ \phi(z) = \begin{cases}
1 & \text{ if } z \ge 0 \\
-1 & \text{ otherwise }.
\end{cases}
\]
In the following sections, we will often make use of basic notations from linear algebra. For example, we will abbreviate the sum of the products of the values in $\mathbf{x}$ and $\mathbf{w}$ using a \textit{vector dot product}, whereas superscript $T$ stands for \textit{transpose}, which is an operation that transforms a column vector into a row vector and vice versa:
\[
z = w_0 x_0 + w_1 x_1 + \dots + w_m x_m = \mathbf{w^T x} = \sum_{j=0}^{m} \mathbf{w_j} \mathbf{x_j} = \mathbf{w}^T \mathbf{x}.
\]
For example:
\[
\big[1 \quad 2 \quad 3 \big] \times \begin{bmatrix}
4 \\
5 \\
6
\end{bmatrix} = 1 \times 4 + 2 \times 5 + 3 \times 6 = 32.
\]
Furthermore, the transpose operation can also be applied to a matrix to
reflect it over its diagonal, for example:
\[
\begin{bmatrix}
1 & 2 \\
3 & 4 \\
5 & 6
\end{bmatrix}^T = \begin{bmatrix}
1 & 3 & 5 \\
2 & 4 & 6
\end{bmatrix}
\]
Rosenblatt's initial perceptron rule is fairly simple and can be summarized by the following steps:
\begin{enumerate}
\item Initialize the weights to 0 or small random numbers.
\item For each training sample $\mathbf{x}^{(i)}$, perform the following steps:
\begin{enumerate}
\item Compute the output value $\hat{y}$.
\item Update the weights.
\end{enumerate}
\end{enumerate}
Here, the output value is the class label predicted by the unit step function that we defined earlier, and the simultaneous update of each weight $w_j$ in the weight vector $\mathbf{w}$ can be more formally written as:
\[
w_j := w_j + \Delta w_j
\]
The value of $\Delta w_j$, which is used to update the weight $w_j$, is calculated by the perceptron rule:
\[
\Delta w_j = \eta \bigg( y^{(i)} - \hat{y}^{(i)} \bigg)x_{j}^{(i)}
\]
Where $\eta$ is the learning rate (a constant between 0.0 and 1.0), $y^{(i)}$ is the true class label of the $i$th training sample, and $\hat{y}^{(i)}$ is the predicted class label. It is important to note that all weights in the weight vector are being updated simultaneously, which means that we don't recompute $\hat{y}^{(i)}$ before all of the weights $\Delta w_j$ were updated. Concretely, for a 2D dataset, we would write the update as follows:
\[
\Delta w_0 = \eta \bigg( y^{(i)} - \hat{y}^{(i)} \bigg)
\]
\[
\Delta w_1 = \eta \bigg( y^{(i)} - \hat{y}^{(i)} \bigg) x_{1}^{(i)}
\]
\[
\Delta w_2 = \eta \bigg( y^{(i)} - \hat{y}^{(i)} \bigg) x_{2}^{(i)}
\]
Before we implement the perceptron rule in Python, let us make a simple thought experiment to illustrate how beautifully simple this learning rule really is. In the two scenarios where the perceptron predicts the class label correctly, the weights remain unchanged:
\[
\Delta w_j = \eta \bigg( -1 -- 1 \bigg)x_{j}^{(i)} = 0
\]
\[
\Delta w_j = \eta \bigg( 1-1 \bigg)x_{j}^{(i)} = 0
\]
However, in the case of a wrong prediction, the weights are being pushed towards the direction of the positive or negative target class, respectively:
\[
\Delta w_j = \eta \bigg( 1 -- 1 \bigg)x_{j}^{(i)} = \eta(2)x_{j}^{(i)}
\]
\[
\Delta w_j = \eta \bigg( -1-1 \bigg)x_{j}^{(i)} = \eta(-2)x_{j}^{(i)}
\]
To get a better intuition for the multiplicative factor $x_{j}^{(i)}$, let us go through another
simple example, where:
\[
y^{(i)} = +1, \quad \hat{y}^{(i)} = -1, \quad \eta = 1
\]
Let's assume that $x_{j}^{(i)}=0.5$ and we misclassify this sample as $-1$. In this case, we would increase the corresponding weight by $1$ so that the net input $x_{j}^{i} \times w_{j}^{(i)}$ will be more positive the next time we encounter this sample and thus will be more likely to be above the threshold of the unit step function to classify the sample as $+1$:
\[
\Delta w_{j} = (1--1)0.5 = (2)0.5 = 1
\]
The weight update is proportional to the value of $x_{j}^{(i)}$. For example, if we have another sample $x_{j}^{(i)}=2$ that is incorrectly classified as $-1$, we'd push the decision boundary by an even larger extent to classify this sample correctly the next time:
\[
\Delta w_{j} = (1--1)2 = (2)2 = 4.
\]
\section{Implementing a perceptron learning algorithm in Python}
\subsection{Training a perceptron model on the Iris dataset}
\section{Adaptive linear neurons and the convergence of learning}
The key difference between the Adaline rule (also known as the Widrow-Hoff rule) and Rosenblatt's perceptron is that the weights are updated based on a linear activation function rather than a unit step function like in the perceptron. In Adaline, this linear activation function $\phi{z}$ is simply the identity function of the net input so that
\[
\phi \big( \mathbf{w}^T \mathbf{x} \big) = \mathbf{w}^T \mathbf{x}
\]
\subsection{Minimizing cost functions with gradient descent}
One of the key ingredients of supervised machine learning algorithms is to define an objective function that is to be optimized during the learning process. This objective function is often a cost function that we want to minimize. In the case of Adaline, we can define the cost function $J(\cdot)$ to learn the weights as the Sum of Squared Errors (SSE) between the calculated outcomes and the true class labels
\[
J(\mathbf{w}) = \frac{1}{2} \sum_i \bigg(y^{(i)} - \phi \big(z^{(i)} \big) \bigg)^2.
\]
Using gradient descent, we can now update the weights by taking a step away from the gradient $\nabla J(\mathbf{w})$ of our cost function $J(\mathbf{\cdot})$:
\[
\mathbf{w} := \mathbf{w} + \Delta \mathbf{w}.
\]
To compute the gradient of the cost function, we need to compute the partial derivative of the cost function with respect to each weight $w_j$,
\[
\frac{\partial J}{\partial w_j} = - \sum_i \bigg( y^{(i)} - \phi \big(z^{(i)} \big) \bigg) x_{j}^{(i)},
\]
so that we can write the update of weight $w_j$ as
\[
\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_i \bigg( y^{(i)} - \phi \big(z^{(i)} \big) \bigg) x_{j}^{(i)}
\]
Since we update all weights simultaneously, our Adaline learning rule becomes
\[
\mathbf{w} := \mathbf{w} + \Delta \mathbf{w}.
\]
For those who are familiar with calculus, the partial derivative of the SSE cost function with respect to the $j$th weight in can be obtained as follows:
\begin{equation*}
\begin{split}
& \frac{\partial J}{\partial w_j} = \frac{\partial}{\partial w_j} \frac{1}{2} \sum_i \bigg( y^{(i)} - \phi \big( z^{(i)} \big) \bigg)^2 \\
& = \frac{1}{2} \frac{\partial}{\partial w_j} \sum_i \bigg( y^{(i)} - \phi \big( z^{(i)} \big) \bigg)^2 \\
& = \frac{1}{2} \sum_i 2 \big( y^{(i)} - \phi(z^{(i)})\big) \frac{\partial}{\partial w_j} \Big( y^{(i)} - \phi({z^{(i)}}) \Big) \\
& = \sum_i \big( y^{(i)} - \phi (z^{(i)}) \big) \frac{\partial}{\partial w_j} \Big( y^{(i)} - \sum_i \big(w^{(i)}_{j} x^{(i)}_{j} \big) \Big) \\
& = \sum_i \bigg( y^{(i)} - \phi \big( z^{(i)} \big) \bigg) \bigg( - x_{j}^{(i)} \bigg) \\
& = - \sum_i \bigg( y^{(i)} - \phi \big( z^{(i)} \big) \bigg) x_{j}^{(i)} \\
\end{split}
\end{equation*}
Performing a matrix-vector multiplication is similar to calculating a vector dot product where each row in the matrix is treated as a single row vector. This vectorized approach represents a more compact notation and results in a more efficient computation using NumPy. For example:
\[
\begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6
\end{bmatrix} \times \begin{bmatrix}
7 \\
8 \\
9
\end{bmatrix} = \begin{bmatrix}
1 \times 7 + 2 \times 8 + 3 \times 9 \\
4 \times 7 + 5 \times 8 + 6 \times 9
\end{bmatrix} = \begin{bmatrix}
50 \\
122
\end{bmatrix}
\]
\subsection{Implementing an Adaptive Linear Neuron in Python}
Here, we will use a feature scaling method called standardization, which gives our data the property of a standard normal distribution. The mean of each feature
is centered at value 0 and the feature column has a standard deviation of 1. For example, to standardize the $j$th feature, we simply need to subtract the sample mean $\mu_j$ from every training sample and divide it by its standard deviation $\sigma_j$:
\[
\mathbf{x'}_j = \frac{\mathbf{x} - \mathbf{\mathbf{\mu_j}}}{\sigma_j}.
\]
Here $\mathbf{x}_j$ is a vector consisting of the $j$th feature values of all training samples $n$.
\subsection{Large scale machine learning and stochastic gradient descent}
A popular alternative to the batch gradient descent algorithm is stochastic gradient descent, sometimes also called iterative or on-line gradient descent. Instead of updating the weights based on the sum of the accumulated errors over all samples $\mathbf{x}^{(i)}$:
\[
\Delta \mathbf{w} = \eta \sum_i \bigg( y^{(i)} - \phi \big( z^{(i)}\big) \bigg) \mathbf{x}^{(i)}.
\]
We update the weights incrementally for each training sample:
\[
\Delta \mathbf{w} = \eta \bigg( y^{(i)} - \phi \big( z^{(i)}\big) \bigg) \mathbf{x}^{(i)}.
\]
\section{Summary}
%%%%%%%%%%%%%%%
% CHAPTER 3
%%%%%%%%%%%%%%%
\chapter{A Tour of Machine Learning Classifiers Using Scikit-learn}
\section{Choosing a classification algorithm}
\section{First steps with scikit-learn}
\subsection{Training a perceptron via scikit-learn}
\section{Modeling class probabilities via logistic regression}
\subsection{Logistic regression intuition and conditional probabilities}
The odds ratio can be written as
\[
\frac{p}{(1-p)},
\]
where $p$ stands for the probability of the positive (1? p)
event. The term positive event does not necessarily mean good, but refers to the event that we want to predict, for example, the probability that a patient has a certain disease; we can think of the positive event as class label $y =1$. We can then further define the logit function, which is simply the logarithm of the odds ratio (log-odds):
\[
logit(p) = \log \frac{p}{1-p}
\]
The logit function takes input values in the range 0 to 1 and transforms them to values over the entire real number range, which we can use to express a linear relationship between feature values and the log-odds:
\[
logit ( p (y=1 | \mathbf{x})) = w_0 x_0 + w_1 x_1 + \cdots + x_m w_m = \sum^{m}_{i=0} w_i x_i = \mathbf{w}^T \mathbf{x}.
\]
Here, $p(y=1 | \mathbf{x})$ s the conditional probability that a particular sample belongs to class 1 given its features $\mathbf{x}$. Now what we are actually interested in is predicting the probability that a certain sample belongs to a particular class, which is the inverse form of the logit function. It is also called the logistic function, sometimes simply abbreviated as sigmoid function due to its characteristic S-shape
\[
\phi(z) = \frac{1}{1+e^{-z}}.
\]
The output of the sigmoid function is then interpreted as the probability of particular sample belonging to class 1
\[
\phi(z) = P(y=1 | \mathbf{x}; \mathbf{w})
\]
given its features $\mathbf{x}$ parameterized by the weights $\mathbf{w}$. For example, if we compute $\phi(z) = 0.8$ for a particular flower sample, it means that the chance that this sample is an Iris-Versicolor flower is 80 percent. Similarly, the probability that this ower is an Iris-Setosa ower can be calculated as $P(y=0 | \mathbf{x};\mathbf{w})=1 - P (y=1 | \mathbf{x}; \mathbf{w}) = 0.2$ or 20 percent. The predicted probability can then simply be converted into a binary outcome via a quantizer (unit step function):
\[ \hat{y}= \begin{cases}
1 & \text{ if } \phi(z) \ge 0.5 \\
0 & \text{ otherwise }.
\end{cases}
\]
If we look at the preceding sigmoid plot, this is equivalent to the following:
\[ \hat{y}= \begin{cases}
1 & \text{ if } \phi(z) \ge 0.0 \\
0 & \text{ otherwise }.
\end{cases}
\]
\subsection{Learning the weights of the logistic cost function}
In the previous chapter, we defined the sum-squared-error cost function:
\[
J(\mathbf{w}) = \frac{1}{2} \sum_i \bigg( \phi \big( z^{(i)} \big) - y^{(i)} \bigg)^2.
\]
We minimized this in order to learn the weights w for our Adaline classification model. To explain how we can derive the cost function for logistic regression, let's first define the likelihood L that we want to maximize when we build a logistic regression model, assuming that the individual samples in our dataset are independent of one another. The formula is as follows:
\[
L(\mathbf{w}) = P(\mathbf{y} | \mathbf{x}; \mathbf{w}) = \prod_{i=1}^{n} P \big( y^{(i)} | x^{(i)}; \mathbf{w} \big) = \prod_{i=1}^{n} \bigg( \phi \big(z^{(i)} \big) \bigg) ^ {y^{(i)}} \bigg( 1 - \phi \big( z^{(i)} \big) \bigg)^{1-y^{(i)}}
\]
In practice, it is easier to maximize the (natural) log of this equation, which is called
the log-likelihood function:
\[
l(\mathbf{w}) = \log L(\mathbf{w}) = \sum_{i=1}^{n} \Bigg[ y^{(i)} \log \bigg(\phi \big( z^{(i)} \big) \bigg) + \bigg(1 - y^{(i)} \bigg) \log \bigg( 1 - \phi \big( z^{(i)} \big) \bigg) \Bigg]
\]
Firstly, applying the log function reduces the potential for numerical under ow, which can occur if the likelihoods are very small. Secondly, we can convert the product of factors into a summation of factors, which makes it easier to obtain the derivative of this function via the addition trick, as you may remember
from calculus.
Now we could use an optimization algorithm such as gradient ascent to maximize this log-likelihood function. Alternatively, let's rewrite the log-likelihood as a cost function $J(\cdot)$ that can be minimized using gradient descent as in \textit{Chapter 2, Training Machine Learning Algorithms for Classification}:
\[
J(\mathbf{w}) = \sum_{i=1}^{n} \Bigg[- y^{(i)} \log \bigg(\phi \big( z^{(i)} \big) \bigg) - \bigg(1 - y^{(i)} \bigg) \log \bigg( 1 - \phi \big( z^{(i)} \big) \bigg) \Bigg]
\]
To get a better grasp on this cost function, let's take a look at the cost that we
calculate for one single-sample instance:
\[
J\big( \phi(z), y; \mathbf{w} \big) = -y \log \big( \phi(z) \big) - (1-y) \log \big(1 - \phi(z) \big).
\]
Looking at the preceding equation, we can see that the rst term becomes zero if
$y = 0$ , and the second term becomes zero if $y = 1$, respectively:
\[
J \big( \phi(z), y; \mathbf{w} \big)= \begin{cases}
- \log \big( \phi(z) \big) \text{ if } y=1\\
- \log \big( 1 - \phi(z) \big) \text{ if } y=0
\end{cases}
\]
\subsection{Training a logistic regression model with scikit-learn}
If we were to implement logistic regression ourselves, we could simply substitute the cost function $J(\cdot)$ in our Adaline implementation from \textit{Chapter 2, Training Machine Learning Algorithms for Classification}, by the new cost function:
\[
J(\mathbf{w}) = \sum_{i=1}^{n} \Bigg[- y^{(i)} \log \bigg(\phi \big( z^{(i)} \big) \bigg) - \bigg(1 - y^{(i)} \bigg) \log \bigg( 1 - \phi \big( z^{(i)} \big) \bigg) \Bigg]
\]
We can show that the weight update in logistic regression via gradient descent is indeed equal to the equation that we used in Adaline in \textit{Chapter 2, Training Machine Learning Algorithms for Classification}. Let's start by calculating the partial derivative of the log-likelihood function with respect to the $j$th weight:
\[
\frac{\partial}{\partial w_j} l(\mathbf{w}) = \Bigg( y \frac{1}{\phi(z)} - (1-y) \frac{1}{1-\phi(z)} \Bigg) \frac{\partial}{\partial w_j} \phi(z)
\]
Before we continue, let's calculate the partial derivative of the sigmoid function first:
\[
\frac{\partial}{\partial z} \phi(z) = \frac{\partial}{\partial z} \frac{1}{1 + e^{-1}} \frac{1}{\big( 1 + e^{-z}\big)^2} e^{-z} = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-z}} \bigg( 1 - \frac{1}{1 + e^{-z}} \bigg) \\\\
\]
\[
= \phi(z)(1-\phi(z)).
\]
Now we can resubstitute $\frac{\partial}{\partial z} \phi(z) = \phi(z)(1-\phi(z))$ in our first equation to obtain the following:
\begin{equation*}
\begin{split}
& \Bigg( y \frac{1}{\phi(z)} - (1-y) \frac{1}{1-\phi(z)} \Bigg) \frac{\partial}{\partial w_j} \phi(z) \\
& = \Bigg( y \frac{1}{\phi(z)} - (1-y) \frac{1}{1-\phi(z)} \Bigg) \phi(z) \big(1 - \phi(z)\big) \frac{\partial}{\partial w_j} z \\
& = \bigg( y \big( 1 - \phi(z) \big) - (1-y) \phi(z) \bigg) x_j \\
& = \big( y - \phi(z) \big) x_j
\end{split}
\end{equation*}
Remember that the goal is to find the weights that maximize the log-likelihood so that we would perform the update for each weight as follows:
\[
w_j := w_j + \eta \sum_{i=1}^{n} \bigg( y^{(i)} - \phi(z^{(i)}) \bigg) x_{j}^{(i)}
\]
Since we update all weights simultaneously, we can write the general update rule as follows:
\[
\mathbf{w} := \mathbf{w} + \Delta \mathbf{w}
\]
We define $\Delta \mathbf{w}$ as follows:
\[
\Delta \mathbf{w} = \eta \nabla l (\mathbf{w})
\]
Since maximizing the log-likelihood is equal to minimizing the cost function $J(\cdot)$ that we defined earlier, we can write the gradient descent update rule as follows:
\[
\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_{i=1}^{n} \bigg( y^{(i)} - \phi(z^{(i)}) \bigg)x_{j}^{(i)}
\]
\[
\mathbf{w} := \mathbf{w} + \Delta \mathbf{w}, \; \Delta \mathbf{w} = - \eta \nabla J(\mathbf{w})
\]
This is equal to the gradient descent rule in Adaline in \textit{Chapter 2, Training Machine Learning Algorithms for Classification}.
\subsection{Tackling overfitting via regularization}
The most common form of regularization is the so-called L2 regularization (sometimes also called L2 shrinkage or weight decay), which can be written as follows:
\[
\frac{\lambda}{2} \lVert \mathbf{w} \rVert^2 = \frac{\lambda}{2} \sum_{j=1}^m w_{j}^{2}
\]
Here, $\lambda$ is the so-called regularization parameter.
In order to apply regularization, we just need to add the regularization term to the cost function that we defined for logistic regression to shrink the weights:
\[
J(\mathbf{w}) = - \sum_{i=1}^{n} \bigg[ y^{(i)} \log \big( \phi(z^{(i)}) \big) - \big( 1 - y ^{(i)} \big) \log \big( 1 - \phi(z^{(i)}) \big) \bigg] + \frac{\lambda}{2} \lVert \mathbf{w}\rVert^2
\]
Then, we have the following regularized weight updates for weight $w_j$:
\[
\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_{i=1}^{n} \bigg( y^{(i)} - \phi(z^{(i)}) \bigg)x_{j}^{(i)} - \eta \lambda w_j,
\] for $j \in \{1, 2, ..., m \}$ (i.e., $j \neq 0 $) since we don't regularize the bias unit $w_0$. \\
Via the regularization parameter $\lambda$, we can then control how well we fit the training data while keeping the weights small. By increasing the value of $\lambda$, we increase the regularization strength.
The parameter \textit{C} that is implemented for the \textit{LogisticRegression} class in scikit-learn comes from a convention in support vector machines, which will be the topic of the next section. \textit{C} is directly related to the regularization parameter $\lambda$ , which is its inverse:
\[
C = \frac{1}{\lambda}
\]
So, we can rewrite the regularized cost function of logistic regression as follows:
\[
J(\mathbf{w}) = C \Bigg[ \sum_{i=1}^{n} \Big( -y^{(i)} \log \big( \phi(z^{(i)} \big) - \big( 1 - y^{(i)} \big) \Big) \log \bigg( 1 - \phi(z^{(i)}) \bigg) \Bigg] + \frac{1}{2} \lVert \mathbf{w} \rVert^2
\]
\section{Maximum margin classification with support vector machines}
\subsection{Maximum margin intuition}
To get an intuition for the margin maximization, let's take a closer look at those \textit{positive} and \textit{negative} hyperplanes that are parallel to the decision boundary, which can be expressed as follows:
\[
w_0 + \mathbf{w}^T \mathbf{x}_{pos} = 1 \quad (1)
\]
\[
w_0 + \mathbf{w}^T \mathbf{x}_{neg} = -1 \quad (2)
\]
If we subtract those two linear equations (1) and (2) from each other, we get:
\[
\Rightarrow \mathbf{w}^T \big( \mathbf{x}_{pos} - \mathbf{x}_{neg} \big) = 2
\]
We can normalize this by the length of the vector $\mathbf{w}$, which is defined as follows:
\[
\lVert \mathbf{w} \rVert = \sqrt{\sum_{j=1}^{m} w_{j}^{2}}
\]
So we arrive at the following equation:
\[
\frac{\mathbf{w}^T ( \mathbf{x}_{pos} - \mathbf{x}_{neg} )}{\lVert \mathbf{w} \rVert} = \frac{2}{\lVert \mathbf{w} \rVert}
\]
The left side of the preceding equation can then be interpreted as the distance between the positive and negative hyperplane, which is the so-called margin that we want to maximize.
Now the objective function of the SVM becomes the maximization of this margin by maximizing $\frac{2}{\lVert \mathbf{w} \rVert}$ under the constraint that the samples are classi ed correctly, which can be written as follows:
\[
w_0 + \mathbf{w}^T \mathbf{x}^{(i)} \ge 1 \text{ if } y^{(i)} = 1
\]
\[
w_0 + \mathbf{w}^T \mathbf{x}^{(i)} < -1 \text{ if } y^{(i)} = -1
\]
These two equations basically say that all negative samples should fall on one side of the negative hyperplane, whereas all the positive samples should fall behind the positive hyperplane. This can also be written more compactly as follows:
\[
y^{(i)} \big( w_0 + \mathbf{w}^T \mathbf{x}^{(i)} \big) \ge 1 \quad \forall_i
\]
In practice, though, it is easier to minimize the reciprocal term $\frac{1}{2} \lVert \mathbf{w} \rVert^2$, which can be solved by quadratic programming.
\subsection{Dealing with the nonlinearly separable case using slack variables}
The motivation for introducing the slack variable $\xi$ was that the linear constraints need to be relaxed for nonlinearly separable data to allow convergence of the optimization in the presence of misclassifications under the appropriate cost penalization. The positive-values slack variable is simply added to the linear constraints:
\[
\mathbf{w}^T \mathbf{x}^{(i)} \ge 1 - \xi^{(i)} \text{ if } y^{(i)} = 1
\]
\[
\mathbf{w}^T \mathbf{x}^{(i)} < -1 + \xi^{(i)} \text{ if } y^{(i)} = -1
\]
So the new objective to be minimized (subject to the preceding constraints) becomes:
\[
\frac{1}{2} \lVert \mathbf{w} \rVert^2 + C \Big(\sum_i \xi^{(i)} \Big)
\]
\subsection{Alternative implementations in scikit-learn}
\section{Solving nonlinear problems using a kernel SVM}
As shown in the next figure, we can transform a two-dimensional dataset onto a new three-dimensional feature space where the classes become separable via the following projection:
\[
\phi(x_1, x_2) = (z_1, z_2, z_3) = (x_1, x_2, x_{1}^{2} + x_{2}^{2})
\]
\subsection{Using the kernel trick to find separating hyperplanes in higher dimensional space}
To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional feature space via a mapping function $\phi(\cdot)$ and train a linear SVM model to classify the data in this new feature space. Then we can use the same mapping function $\phi(\cdot)$ to transform new, unseen data to classify it using the linear SVM model.
However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play. Although we didn't go into much detail about how to solve the quadratic programming task to train an SVM, in practice all we need is to replace the dot product
\[
\mathbf{x}^{(i) \; T} \mathbf{x}^{(j)} \text{ by } \phi \big( \mathbf{x}^{(i)} \big)^T \phi \big( \mathbf{x}^{(j)} \big)
\]
In order to save the expensive step of calculating this dot product between two points explicitly, we de define a so-called kernel function:
\[
k \big( \mathbf{x}^{(i)}, \mathbf{x}^{(j)} \big) = \phi \big( \mathbf{x}^{(i)} \big)^T \phi \big( \mathbf{x}^{(j)} \big)
\]
One of the most widely used kernels is the \textit{Radial Basis Function kernel} (RBF kernel) or Gaussian kernel:
\[
k \big( \mathbf{x}^{(i)}, \mathbf{x}^{(j)} \big) = \exp \Bigg( - \frac{ \lVert \mathbf{x}^{(i)} - \mathbf{x}^{(j)} \rVert^2 }{2 \sigma^2} \Bigg)
\]
This is often simplified to:
\[
k \big( \mathbf{x}^{(i)}, \mathbf{x}^{(j)} \big) = \exp \bigg( -\gamma\ \lVert \mathbf{x}^{(i)} - \mathbf{x}^{(j)} \rVert^2 \bigg)
\]
Here, $\gamma = \frac{1}{2 \sigma^2}$ is a free parameter that is to be optimized.
\section{Decision tree learning}
In order to split the nodes at the most informative features, we need to define an objective function that we want to optimize via the tree learning algorithm. Here, our objective function is to maximize the information gain at each split, which we define as follows:
\[
IG(D_p, f) = I(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} I(D_j)
\]
Here, $f$ is the feature to perform the split, $D_p$ and $D_j$ are the dataset of the parent $p$ and $j$th child node; $I$ is our impurity measure, $N_p$ is the total number of samples at the parent node, and $N_j$ is the number of samples at the $j$th child node. As we can see, the information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities?the lower the impurity of the child nodes, the larger the information gain. However, for simplicity and to reduce the combinatorial search space, most libraries (including scikit-learn) implement binary decision trees. This means that each parent node is split into two child nodes, $D_{left}$ and $D_{right}$:
\[
IG(D_p, f) = 1 (D_p) - \frac{N_{left}}{N_p} I(D_{left}) - \frac{N_{right}}{N_p} I (D_{right})
\]
Now, the three impurity measures or splitting criteria that are commonly used in
binary decision trees are \textit{Gini impurity} ($I_G$), \textit{Entropy} ($I_H$) and the \textit{classification error} ($I_E$). Let's start with the definition of Entropy for all non-empty classes $p(i | t) \neq 0$:
\[
I_H(t) = - \sum_{i=1}^{c} p(i | t) \log_2 p(i|t)
\]
Here, $p(i | t)$ is the proportion of the samples that belongs to class $i$ for a particular node $t$. The entropy is therefore 0 if all samples at a node belong to the same class, and the entropy is maximal if we have a uniform class distribution. For example, in a binary class setting, the entropy is 0 if $p(i=1 | t) = 1$ or $p(i=0| t)=0$. If the classes are distributed uniformly with $p(i=1|t)=0.5$ and $p(i=0|t)=0.5$, the entropy is 1. Therefore, we can say that the entropy criterion attempts to maximize the mutual information in the tree.
Intuitively, the Gini impurity can be understood as a criterion to minimize the probability of misclassification:
\[
I_G(t) = \sum_{i=1}^{c} p(i | t) (1 - p (i | t)) = 1 - \sum_{i=1}^{c} p(i|t)^2
\]
Similar to entropy, the Gini impurity is maximal if the classes are perfectly mixed, for example, in a binary class setting ($c = 2 $):
\[
I_G(t) = 1 - \sum_{i=1}^{c} 0.5^2 = 0.5.
\]
...
Another impurity measure is the classification error:
\[
I_E(t) = 1 - max \{ p(i|t) \}
\]
\subsection{Maximizing information gain -- getting the most bang for the buck}
\subsection{Building a decision tree}
\subsection{Combining weak to strong learners via random forests}
\section{K-nearest neighbors -- a lazy learning algorithm}
The \textit{minkowski} distance that we used in the previous code example is just a generalization of the Euclidean and Manhattan distances that can be written as follows:
\[
d \big(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}\big) = \sqrt[p]{\sum_k \big| x_{k}^{(i)} - x_{k}^{(j)} \big|^p }
\]
\section{Summary}
%%%%%%%%%%%%%%%
% CHAPTER 4
%%%%%%%%%%%%%%%
\chapter{Building Good Training Sets -- Data Pre-Processing}
\section{Dealing with missing data}
\subsection{Eliminating samples or features with missing values}
\subsection{Imputing missing values}
\subsection{Understanding the scikit-learn estimator API}
\section{Handling categorical data}
\subsection{Mapping ordinal features}
\subsection{Encoding class labels}
\subsection{Performing one-hot encoding on nominal features}
\section{Partitioning a dataset in training and test sets}
\section{Bringing features onto the same scale}
Now, there are two common approaches to bringing different features onto the same
scale: \textit{normalization} and \textit{standardization}. Those terms are often used quite loosely
in different fields, and the meaning has to be derived from the context. Most often,
\textit{normalization} refers to the rescaling of the features to a range of [0, 1], which is a
special case of min-max scaling. To normalize our data, we can simply apply the
min-max scaling to each feature column, where the new value $x_{norm}^{(i)}$ of a sample $x^{(i)}$:
\[
x_{norm}^{(i)} = \frac{x^{(i)} - \mathbf{x}_{min}}{\mathbf{x}_{max} - \mathbf{x}_{min}}
\]
Here, $x^{(i)}$ is a particular sample, $x_{min}$ is the smallest value in a feature column, and $x_{max}$ the largest value, respectively.
[...] Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales
the data to a limited range of values.
The procedure of standardization can be expressed by the following equation:
\[
x_{std}^{(i)} = \frac{x^{(i)} - \mu_{x}}{\sigma_{x}}
\]
Here, $\mu_{x}$ is the sample mean of a particular feature column and $\sigma_{x}$ the corresponding standard deviation, respectively.
\section{Selecting meaningful features}
\subsection{Sparse solutions with L1 regularization}
We recall from \textit{Chapter 3, A Tour of Machine Learning Classfiers Using Scikit-learn}, that L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we defined the L2 norm of our weight vector w as follows:
\[
L2: \lVert \mathbf{w} \rVert^{2}_{2} = \sum_{j=1}^{m} w^{2}_{j}
\]
Another approach to reduce the model complexity is the related \textit{L1 regularization}:
\[
L1: \lVert \mathbf{w} \rVert_{1} = \sum_{j=1}^{m} |w_j|
\]
\subsection{Sequential feature selection algorithms}
Based on the preceding definition of SBS, we can outline the algorithm in 4 simple steps:
\begin{enumerate}
\item Initialize the algorithm with $k=d$, where $d$ is the dimensionality of the full feature space $\mathbf{X}_d$
\item Determine the feature $x^{-}$ that maximizes the criterion $x^{-} = \text{arg max} J(\mathbf{X_k} - x)$, where $x \in \mathbf{X}_k$.
\item Remove the feature $x^-$ from the feature set: $\mathbf{X}_{k-l} := \mathbf{X}_k - x^{-}; \quad k:= k-1$.
\item Terminate if $k$ equals the number of desired features, if not, got to step 2.
\end{enumerate}
\section{Assessing feature importance with random forests}
\section{Summary}
%%%%%%%%%%%%%%%
% CHAPTER 5
%%%%%%%%%%%%%%%
\chapter{Compressing Data via Dimensionality Reduction}
\section{Unsupervised dimensionality reduction via principal component analysis}
When we use PCA for dimensionality reduction, we construct a $d \times k$-dimensional transformation matrix $\mathbf{W}$ that allows us to map a sample vector $\mathbf{x}$ onto a new $k$-dimensional feature subspace that has fewer dimensions than the original $d$-dimensional feature space:
\[
\mathbf{x} = [ x_1, x_2, \dots, x_j], \mathbf{x} \in \mathbb{R}^4
\]
\[
\downarrow \mathbf{x W}, \quad \mathbf{W} \in \mathbb{R}^{d \times k}
\]
\[
\mathbf{z} = [z_1, z_2, \dots, z_k], \quad \mathbf{z} \in \mathbb{R}^4
\]
As a result of transforming the original $d$-dimensional data onto this new
$k$-dimensional subspace (typically $k << d$ ), the rst principal component will have
the largest possible variance, and all consequent principal components will have the largest possible variance given that they are uncorrelated (orthogonal) to the other principal components. Note that the PCA directions are highly sensitive to data scaling, and we need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.
Before looking at the PCA algorithm for dimensionality reduction in more detail, let's summarize the approach in a few simple steps:
\begin{enumerate}
\item Standardize the $d$-dimensional dataset.
\item Construct the covariance matrix.
\item Decompose the covariance matrix into its eigenvectors and eigenvalues.
\item Select $k$ eigenvectors that correspond to the $k$ largest eigenvalues, where $k$ is the dimensionality of the new feature subspace $(k \le d)$.
\item Construct a projection matrix $\mathbf{W}$ from the "top" $k$ eigenvectors.
\item Transform the $d$-dimensional input dataset $\mathbf{X}$ using the projection matrix $\mathbf{W}$ to obtain the new $k$-dimensional feature subspace.
\end{enumerate}
\subsection{Total and explained variance}
After completing the mandatory preprocessing steps by executing the preceding code, let's advance to the second step: constructing the covariance matrix. The symmetric $d \times d$ -dimensional covariance matrix, where $d$ is the number of dimensions in the dataset, stores the pairwise covariances between the different features. For example, the covariance between two features $\mathbf{x}_j$ and $\mathbf{x}_k$ on the population level can be calculated via the following equation:
\[
\sigma_{jk} = \frac{1}{n} \sum_{i=1}^{n} \big( x_{j}^{(i)} - \mu_j \big) \big( x_{k}^{(i)} - \mu_k \big)
\]
Here, $\mu_j$ and $\mu_k$ are the sample means of feature $j$ and $k$, respectively. [...] For example, a covariance matrix of three features can then be written as
\[
\Sigma = \begin{bmatrix}
\sigma_{1}^2 & \sigma_{12} & \sigma_{13} \\
\sigma_{21} & \sigma_{2}^{2} & \sigma_{23} \\
\sigma_{31} & \sigma_{32} & \sigma_{3}^{2}
\end{bmatrix}
\]
[...] an eigenvector $\mathbf{v}$ satisfies the following condition:
\[
\Sigma \mathbf{v} = \lambda \mathbf{v}
\]
Here, $\lambda$ is a scalar: the eigenvector.
...
The variance explained ratio of an eigenvalue $\lambda_j$ is simply the fraction of an eigenvalue $\lambda_j$ and the total sum of the eigenvalues:
\[
\frac{\lambda_j}{\sum_{j=1}^{d} \lambda_j}
\]
\subsection{Feature transformation}
Using the projection matrix, we can now transform a sample $\mathbf{x}$ onto the PCA subspace obtaining $\mathbf{x}'$, a now two-dimensional sample vector consisting of two new features:
\[
\mathbf{x}' = \mathbf{xW}
\]
\subsection{Principal component analysis in scikit-learn}
\section{Supervised data compression via linear discriminant analysis}
Before we take a look into the inner workings of LDA in the following subsections, let's summarize the key steps of the LDA approach:
\begin{enumerate}
\item Standardize the $d$-dimensional dataset ($d$ is the number of features).
\item For each class, compute the $d$ dimensional mean vector.
\item Construct the between-class scatter matrix $\mathbf{S}_B$ and the within-class scatter matrix $\mathbf{S}_W$.
\item Compute the eigenvectors and corresponding eigenvalues of the matrix $\mathbf{S}_{W}^{-1} \mathbf{S}_B$.
\item Choose the $k$ eigenvectors that correspond to the $k$ largest eigenvalues to construct a $d \times k$-dimensional transformation matrix $\mathbf{W}$; the eigenvectors are the columns of this matrix.
\item Project the samples onto the new feature subspace using the transformation matrix $\mathbf{W}$.
\end{enumerate}
\subsection{Computing the scatter matrices}
Each mean vector $\mathbf{m}_i$ stores the mean feature value $\mu_m$ with respect to the samples of class $i$:
\[
\mathbf{m}_i = \frac{1}{n_i} \sum_{x \in D_i}^{c} \mathbf{x}_m
\]
This results in three mean vectors:
\[
\mathbf{m}_i = \begin{bmatrix}
\mu_{i, \text{alcohol}} \\
\mu_{i, \text{malic-acid}} \\
\mu_{i, \text{proline}}
\end{bmatrix}^T, i \in \{ 1, 2, 3 \}
\]
Using the mean vectors, we can now compute the within-class scatter matrix $\mathbf{S}_W$
\[
\mathbf{S}_W = \sum^{c}_{i=1} \mathbf{S}_i
\]
This is calculated by summing up the individual scatter matrices $S_i$ of each
individual class $i$:
\[
\mathbf{S}_i = \sum_{x \in D_i}^{c} (\mathbf{x} - \mathbf{m}_i) (\mathbf{x} - \mathbf{m}_i)^T
\]
The assumption that we are making when we are computing the scatter matrices is that the class labels in the training set are uniformly distributed. [...] Thus, we want to scale the individual scatter matrices $\mathbf{S}_i$ before we sum them up as scatter matrix $\mathbf{S}_W$ When we divide the scatter matrices by the number of class samples $\mathbf{N}_i$, we can see that computing the scatter matrix is in fact the same as computing the covariance matrix $\mathbf{Sigma}_i$ The covariance matrix is a normalized version of the scatter matrix:
\[
\Sigma_i = \frac{1}{N_i} \mathbf{S}_W = \frac{1}{N_i} \sum^{c}_{x \in D_i} (\mathbf{x} - \mathbf{m}_i) (\mathbf{x} - \mathbf{m}_i)^T
\]
After we have computed the scaled within-class scatter matrix (or covariance matrix), we can move on to the next step and compute the between-class scatter matrix $\mathbf{S}_B$
\[
\mathbf{S}_B = \sum^{c}_{i=1} N_i (\mathbf{m}_i - \mathbf{m})(\mathbf{m}_i - \mathbf{m})^T
\]
Here, $\mathbf{m}$ is the overall mean that is computed, including samples from all classes.
\subsection{Selecting linear discriminants for the new feature subspace}
\subsection{Projecting samples onto the new feature space}
\[
\mathbf{X'} = \mathbf{XW}