Classify-spam-comment

Distinguish spam messages by using Naive Bayes Classifiers (Discrete Mathmatics)

Description

Check whether the string input is negative or not by using the machine-learning data.

(1) using the given dataset to create a trainer to make prediction model that is a large table format. (2) using prediction model to evaluate whether a given string is negative or non-negative.

Approach

1. Implement trainer module(trainer.c)

tokenization
Normalization
Stopword removal
Vocaburarly reduction
Construct prediction model

2. Implement predictor module(predictor.c)

Enter the file name
Text is entered line by line and stored in a "string" file.
Tokenization by using “string” file (s_token.txt)
Normalization by using “s_token.txt” file (s_norm.txt)
Remove stop word by using “s_norm.txt”(s_stop.txt)
In “s_stop.txt” file, bring all the words into the array.
Use the functions below to find negative, non-negative probabilities by word in the prdiction model.
Log scaling
Store result (result.txt)

Evaluation

make => ./trainer => gcc predictor.c -o p => ./p

Through these results, it can be confirmed that the precision and recall change according to the thresh value of the trainer's vocabulary reduction.
When the classification threshold is decreased, the precision value increases and the recall value decreases.
When the classification threshold is increased, the precision value decreases and the recall value increases.
It can be seen that the precision and recall values conflict with each other.
Therefore, it can be seen that the performance varies depending on how the threshold of the classifier is determined.

Limitation

If the value of the newly given feature type does not exist in the previously learned feature, the probability is 0, and multiplying it will result in a final probability of 0. So, I use Laplace smoothing for all probability.
Since the probability derived using the classifier is less than 1, if there is a lot of probability to multiply, the value continues down, and the value comes out so small that it is difficult to distinguish. So, by taking logs on all the probabilities, I prevented underflow.

If you have any questions, don't hesitate to send e-mail [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
example		example
include		include
lib		lib
libstemmer_c		libstemmer_c
src		src
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classify-spam-comment

Description

Approach

1. Implement trainer module(trainer.c)

2. Implement predictor module(predictor.c)

Evaluation

Limitation

About

Releases

Packages

Languages

siwany/Classify-Spam-Comment

Folders and files

Latest commit

History

Repository files navigation

Classify-spam-comment

Description

Approach

1. Implement trainer module(trainer.c)

2. Implement predictor module(predictor.c)

Evaluation

Limitation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages