This is a research project that utilizes hacker forum data for proactive cyber threat intelligence. This research paper employs state-of-the-art machine learning and deep learning approach to automatically classify hacker forum data into predefined categories and develop interactive visualizations enabling CTI practitioners to explore collected data for proactive and timely CTI. The results from this research shows that among all the models, deep learning model RNN GRU gives the best classification results with 99.025% accuracy and 96.56% precision.
Update - The high accuracy and precision scores was because of the overfitting of the data. The updated code is run and the results shows that RNN GRU still gives the best classification results but with reasonable accuracy of 98.8% and precision of 96.6%. On the other hand, the ML models, SVM shows accuracy of 97.3% and precision of 76.4% whereas Random Forest shows accuracy of 99.6% and precision of 96.7%.
- Python 3.7
- Tensorflow
- Anaconda
- Sklearn
- Pandas
- Keras
- Seaborn
- Numpy
- NLTK