-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLab4_ECGSignalClassification.qmd
194 lines (130 loc) · 7.14 KB
/
Lab4_ECGSignalClassification.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: "ECG Signal Classification with the k-NN Algorithm"
format:
html:
html-math-method: katex
pdf: default
# pdf:
# include-in-header:
# text: \providecommand{\grtlessH}{\underset{{H_0}}{\overset{H_{1}}{\gtrless}}}
ipynb: default
toc: true
number-depth: 2
number-sections: true
jupyter: python3
execute:
freeze: true
eval: false
---
# Objective
Implement and use the k-NN algorithm for classification of various signals.
# Theoretical aspects
## ML decision with multiple samples in Gaussian noise
In a detection problem with Gaussian noise, we have seen in the lectures that
decision with the Maximum Likelihood criterion comes down to **choosing the smallest distance**:
$$
\providecommand{\grtlessH}{\underset{{H_0}}{\overset{H_{1}}{\gtrless}}}
| r(t_0) - s_0(t_0) | \grtlessH | r(t_0) - s_1(t_0) |
$$
What happens when we have **multiple samples**?
1. We have vectors of samples: $\mathbf{r}, \mathbf{s_0}, \mathbf{s_1}$ (bold font = it is a vector)
2. The distance between any two vectors $\mathbf{a}$ and $\mathbf{b}$ is the **Euclidean distance**:
$$
d(\mathbf{a}, \mathbf{b}) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... (a_n - b_n)^2}
$$
3. We use the same distance-based rule:
$$
\providecommand{\grtlessH}{\underset{{H_0}}{\overset{H_{1}}{\gtrless}}}
d(\mathbf{r}, \mathbf{s_0}) \grtlessH d(\mathbf{r}, \mathbf{s_1})
$$
The smallest distance wins.
## The k-NN algorithm
Suppose we have a set of **training signals** whose classes are known beforehand.
For example:
- 100 signals of class A (e.g. ECG heart signals from healthy persons)
- 100 signals of class B (e.g. ECG heart signals from ill persons)
- maybe more classes
We have a new signal X. We need to decide to which class it belongs (class A, or class B, etc).
We can use **the k-NN algorithm**:
1. Compute the distances from X to all the signals in the training set
2. Choose the **closest $k$ neighbors**, take the class of the majority of them
(e.g. majority voting). Decide that this is the class of $X$.
A visual illustration[^1] is below:
![Illustration of the k-NN algorithm](img/kNN.png)
[^1]: image from ['KNN Classification using Scikit-learn", Avinash Navlani](https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn)
## Datasets organization
Usually, we have at our disposal a large class of signals whose classes are known.
The data is randomly split into:
- a **training set**: this data is used for the majority voting
- a **test set**: used only for **evaluation** of the algorithm performance.
This data should never be used for training (the algorithm should never
have seen this data before the testing).
- (optional) a **cross-validation set**: a subset of the training set,
used to determine which values of $k$ work best
The datasets are obtained by randomly splitting all the signals available at the beginning.
Common sizes of the datasets should be around $70\%$ for the training set, $15\%$ for the cross-validation set,
$15\%$ for the testing set.
# Practical setup
## k-NN algorithm in Matlab
Matlab has several implementations of the k-NN algorithm available.
For our purposes we can use the function `knnsearch()`.
The call below finds the 10 nearest neighbors in X for each vector in Y,
and returns the indices of the nearest neighbors in Idx, and the distances in D.
`X` and `Y` contain the vectors on rows
```matlab
[Idx, D] = knnsearch(X, Y, 'K', 10); % find 1- nearest neighbors for vectors in Y
```
One can also use the `fitcknn()` function, which is tailored for classification.
`fitknn()` creates the model, and `predict()` will output the predicted class for one
or more test vectors.
```matlab
Mdl = fitcknn(X,Y, 'NumNeighbors', 3); % use 3 neighbors
label = Mdl.predict(test_data);
```
- `X` = training data, each row is a vector
- `Y` = classes of the training data, each value = class of corresponding vector
- `test_data` = test data, each row is a vector for which we predict the class
## Data for this laboratory
In this laboratory we will use ECG signal data from the [MIT-BIH Arrhythmia database](https://physionet.org/content/mitdb/1.0.0/).
![Sample ECG signal from the MIT-BIH Arrhythmia database](img/MITBIH.png)
(image from https://archive.physionet.org/physiobank/database/mitdb/)
The excerpt provided for this lab contains electrocardiographic (ECG) signals from **4 classes**, with **120 signals per class**.
The 4 classes are:
1. 1 class with ECG from healthy persons
2. 3 classes with 3 different types of arrhythmia (irregular/abnormal heart beat)
The ECG signals provided here are preprocessed:
- all signals are segmented in **segments** corresponding to **one heart beat**
- the signals are resized to **fixed length** 256 samples
- the signals are resized so that the peak R wave is located at the center of the signal
- the continuous component of all signals has been removed
- the signals have been normalized to norm equal to 1
The signals are randomly split into two sets:
- training set: `ECG_train.mat`, 400 signals = 4 classes $\times$ 100 signals each
- test set: `ECG_test.mat`, 80 signals = 4 classes $\times$ 20 signals each
# Exercises
1. Load the data files `ECG_train.mat` and `ECG_test.mat`. Explore the dataset:
- display 3 signals from each class contained in the training set. Try to figure out some visual differences.
- display the first signal from the test dataset. Try to determine visually to what class it belongs to.
2. Take the first signal from the test set, and manually classify it with k-NN using k = 5:
- compute the Euclidean distance to all the vectors in the trainset
- find the smallest 5 values and the class of the 5 corresponding vectors
- find the class of the test signal (in words, not in code)
3. Predict the class of the same signal using the built-in functions `fitcknn()` and `predict()`
3. Compute the confusion matrix of the classification.
- Use the built-in functions `fitcknn()` and `predict()` to predict the class for all test signals.
Use different values for $k$: $k=1$, then $k=5$, then $k=15$.
- Compare the classification results against the ground truth and compute the **confusion matrix** A:
$A_{ij}$ = how many signals of class $i$ are predicted to be in class $j$.
4. Repeat exercise 3, this time adding a variable amount of gaussian noise to the test signals.
How does the performance change?
5. Repeat exercise 4., this time adding a DC component to the test signals.
How does the performance change?
Old exercises:
2. Implement a function ``[class] = myKNN(signal, k, trainset)`` for performing k-NN classification of a signal:
- the function takes as input an unclassified signal `signal`, the parameter value `k`, and the training set matrix `trainset`
- the function computes the Euclidean distance between `signal` and each vector from the training set
- the output `class` is defined by the majority of the $k$ nearest neighbours of the signal
# Final questions
1. How does the confusion matrix look like in the ideal case?
(perfect classification)
2. Is there a problem in case the classes are imbalanced? (different number of signals for the classes)