Create a new directory Data
and place the csv files containing the data of the two classes (separately) in it.
Add the relevant column names to the list features
in the binClassifier.py
.
Assign the split values to split1
and split2
in binClassifier.py
.
On running binClassifier.py
, the dataset is shuffled and sampled 100 times. The mean, minimum and maximum accuracies are printed(classwise and overall).
The probability density function used is the multivariate normal distribution. the likelihood p(x|wi) is given by
x is the d-dimensional feature vector, μ is the mean vector, Σ is the covariance matrix,|Σ| is the determinant of the covariance matrix, Σ−1 is the inverse of the covariance matrix and (x − μ)t is the transpose of the (x - μ) vector. p(x) is calculated using the covariance matrix of the data of a class wi.
p(x) for each of the classes is computed given the equation for multivariate normal distribution. This would be p(x|wi ) for i = 1, 2 (being a binary classifier).
the apriori probabilities P(w1) and P(w2) are calculated using
P(wi) = (numberof data points in wi) / (total number of datapoints)
the evidence for each data point(in the test set) is calculated using the equation
p(x) = P (w1) ∗ p(x|w1) + P (w2) ∗ p(x|w2) (being a two category case)
Using Bayes rule, the posterior probability (Conditional probability) is found for each of the two classes.
posterior probability = apriori probability ∗ likelihood / evidence
Now with the conditional probabilities computed for each of the two classes, we can make a prediction based off of the values of P(w1|x) and P (w2|x).
And being a minimum error rate classifier, we define the discriminant function gi(x) as P (wi|x). if P(w1|x) ≥ P (w2|x) (i.e., g1(x) ≥ g2(x)) then we predict the class to be w1 and predict w2 otherwise.