Distinguish spam messages by using Naive Bayes Classifiers (Discrete Mathmatics)
Check whether the string input is negative or not by using the machine-learning data.
(1) using the given dataset to create a trainer to make prediction model that is a large table format. (2) using prediction model to evaluate whether a given string is negative or non-negative.
- tokenization
- Normalization
- Stopword removal
- Vocaburarly reduction
- Construct prediction model
- Enter the file name
- Text is entered line by line and stored in a "string" file.
- Tokenization by using “string” file (s_token.txt)
- Normalization by using “s_token.txt” file (s_norm.txt)
- Remove stop word by using “s_norm.txt”(s_stop.txt)
- In “s_stop.txt” file, bring all the words into the array.
- Use the functions below to find negative, non-negative probabilities by word in the prdiction model.
- Log scaling
- Store result (result.txt)
make => ./trainer => gcc predictor.c -o p => ./p
- Through these results, it can be confirmed that the precision and recall change according to the thresh value of the trainer's vocabulary reduction.
- When the classification threshold is decreased, the precision value increases and the recall value decreases.
- When the classification threshold is increased, the precision value decreases and the recall value increases.
- It can be seen that the precision and recall values conflict with each other.
- Therefore, it can be seen that the performance varies depending on how the threshold of the classifier is determined.
- If the value of the newly given feature type does not exist in the previously learned feature, the probability is 0, and multiplying it will result in a final probability of 0. So, I use Laplace smoothing for all probability.
- Since the probability derived using the classifier is less than 1, if there is a lot of probability to multiply, the value continues down, and the value comes out so small that it is difficult to distinguish. So, by taking logs on all the probabilities, I prevented underflow.
If you have any questions, don't hesitate to send e-mail [email protected]