Isolation Forest, also known as iForest, is a data structure for anomaly detection. Traditional model-based methods need to construct a profile of normal instances and identify the instances that do not conform to the profile as anomalies. The traditional methods are optimized for normal instances, so they may cause false alarms. They may not be suitable for large data sets as well, due to high complexities. iForest, different from traditional methods, is more robust in these situations.
This project contains an encapsulation of iForest in scikit-learn
together with some data prerocessing tools. It is specifically used for categorical data. The idea of iForest is from Liu, Ting and Zhou's research.[1]
This section contains a short instruction about how to configure and run this program.
Python (>= 3.3)
NumPy (>= 1.8.2) and SciPy (>= 0.13.3)
If you don't have NumPy
or SciPy
on your system, you can install them with the following command.[2]
python3 -m pip install --user numpy scipy
Pandas
Pandas
is a data analysis library for Python. To install Pandas, you can use the following command.[4]
pip3 install pandas
scikit-learn
The library scikit-learn
contains a set of simple tools for machine learning.[3]
pip3 install scikit-learn
- Other missing packages
If you miss any other packages required by the listed prequisites, please follow the instructions to install them.
The main
function is in main.py
. You may run this script directly. You can also import iForest in your own projects.
This project contains a set of files. This section introduces the functions of these files and the core functions within the corresponding files.
Script main.py
contains the main function. It shows a sample procedure of using iForest. This script is also available for you to run the test in the paper [1].
This function shows the procedure of using iForest to detect anomalies. It returns the data frame
of anomalies and takes the following arguments as input:
filename
: String type. The path of the data file. The file should be in CSV format.
threshold
: The threshold of decision boundary in iForest. It ranges from -1 to 0. In my test this parameter is usually set to values in -0.4 to -0.1.
This is the main function to call, but it is mainly used to measure the detection time and accuracy. It calls CatForest
for all iForest related procedures.
This script contains an encapsulation of the iForest function in scikit-learn
.
This function encapsulates the IsolationForest
class in scikit-learn
. It returns an array
with the decision boundaries smaller than threshold
.
This script preprocesses the data. It contains the following two functions.
This function mainly transfers categorical data to numerical data. The original iForest algorithm works for numerical only so categorical data has to be pre-processed.
Note that the columns are now hard-coded in this script, which is uneasy to fit different types of data. I will write a parser in the future to read the metadata from files.
This function adds flags to existing data. It introduces domain knowledge to the data and is only used for future testing purposes.
This script is a file loader.
This function reads the data from CSV
files. It returns the file contents and top-n-keys. The input parameter filename
should be the path to a valid CSV file.
This sections introduces basic steps to apply iForest to your data.
You can follow the following example to read the data.
data, topnkey = readindividual(filename)
The data is preprocessed for each value in the range of topnkey
.
total_num_examples = 0
for num in range(1,len(topnkey)):
category, user, predict = preprocessing(data, topnkey, num)
total_num_examples += len(user)
Tis process is also done for each value in the range of topnkey
. It should be coded in the loop in Step 2
.
u = iforest(category, user, threshold)
anomalies = predict.iloc[u,:]
- Implement a parser for reading the metadata of the data in the CSV file.
- Use domain knowledge to verify the classification.
[1] Isolation Forest https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
[2] SciPy Official Site https://www.scipy.org/about.html
[3] Installing scikit-learn http://scikit-learn.org/stable/install.html
[4] Pandas https://pandas.pydata.org