This repository contains the analysis and models used to differentiate between "Scam" and "Non-scam" messages. The data was initially extracted from images using OCR (Optical Character Recognition) as part of the ScamGuard project, which focused on developing an image-based identification app for phishing and scam messages. This research builds upon the previous work by exploring sentiment analysis and machine learning models to enhance scam detection capabilities.
- Project Description
- Methodology
- Results
- Research Limitations
- Recommendations
- How to Run the Code
- License
- Acknowledgments
- Contact Information
This project aims to analyze and classify messages as "Scam" or "Non-scam" using various text and sentiment analysis techniques. We evaluated different machine learning models, including logistic regression and XGBoost, to determine the most effective approach for scam detection.
- Text Vectorization: Utilized TF-IDF vectorization to convert text into numerical features.
- Feature Engineering: Added features such as sentiment scores and text length to the model.
- Model Training: Tested various models, including logistic regression and XGBoost, to find the most accurate classifier.
- Accuracy
- Precision, Recall, and F1-Score
- Confusion Matrix
- Histograms of Sentiment Scores
- XGBoost Model: Achieved the highest accuracy in classifying scam and non-scam messages.
- Accuracy: 93.75%
- Scam Messages: Mean sentiment score of 0.18.
- Non-scam Messages: Mean sentiment score of 0.16.
- Scam Messages: Top words include "free," "money," "link," etc.
- Non-scam Messages: Top words include "lazada," "peso," "sale," etc.
- Dataset Constraints: The dataset derived from images and OCR preprocessing may contain inaccuracies or artifacts introduced during text extraction, potentially affecting the accuracy of the sentiment analysis and model performance.
- Model Generalization: The study focused on text extracted from specific types of images (scam-related messages), which may limit the generalizability of the findings to other forms of text data or different types of scams not covered in this dataset.
- Regional Specificity: The data used in this study is Philippine-based, which might not be representative of scam messages in other regions. The regional context and language nuances may impact the applicability of the results to other geographic locations or cultural contexts.
- Enhanced Filtering Techniques: Implement advanced filters and update them regularly to detect and filter out scam messages based on key terms and sentiment patterns.
- User Education: Educate users about common scam tactics and red flags through awareness campaigns and best practice guidelines.
- Sentiment and Emotional Analysis Integration: Utilize sentiment and emotion-based classification as part of the screening process for scam detection.
- Local Language Considerations: Adapt filters for local languages and collaborate with local experts to refine detection techniques.
- Clone the repository:
git clone https://github.com/your-username/repository-name.git
- Navigate to the project directory:
cd repository-name
- Install the required dependencies:
pip install -r requirements.txt
- Run the analysis script:
python analyze.py
The dataset used in this research is private and intended for research purposes only. To obtain access to the data, please contact Heroshi Joe Abejuela directly at [email protected].
See the LICENSE file for details.
- Heroshi Joe Abejuela: Researcher and author of this study.
- ScamGuard Project: Previous work on image-based identification app for phishing.
For any inquiries or further information, please contact:
- Heroshi Joe Abejuela
- Email: [email protected]
- GitHub: HiroshiJoe