RNN implementation in Keras to predict malware from machine activity data - code for experiments in Early Stage Malware Prediction Using Recurrent Neural Networks
RNN_prediction.mov
(^^Percentage certainty of model is for demo-purposes and should not be taken as an indicator of model reliability!)
Data (data_2.csv) available here http://doi.org/10.17035/d.2018.0050524986
Experiments from the paper are set out in order in run_experiments
Implementation uses Keras v2.0.6 and Python >= 3.4
If you use our code in your research please cite:
@article{RHODE2018578,
title = "Early-stage malware prediction using recurrent neural networks",
journal = "Computers & Security",
volume = "77",
pages = "578 - 594",
year = "2018",
issn = "0167-4048",
doi = "https://doi.org/10.1016/j.cose.2018.05.010",
url = "http://www.sciencedirect.com/science/article/pii/S0167404818305546",
author = "Matilda Rhode and Pete Burnap and Kevin Jones",
}
The basic Experiment class in experiments > Experiments takes a dictionary of hyperparameters and data as objects (either as a tuple for k-fold cross validation or as four separate test/train inputs/labels items). Results are stored in a folder as a comma separated values file.
-
parameters: A dictionary of hyperparamters such as in experiments > Configs. The keys relate to the RNN implementation and the values can either be a list or a dictionary of possible values with associated weighted probabilities of choosing them. The latter is intended to aid biased random searches e.g. params = {.... ["dropout"] = {0.2: 0.5, 0.1: 0.25, 0.3: 0.25}.... } will bias the random search to choose a value of 0.2 half of the time, and 0.1 or 0.3 respectively quarter of the time.
-
search_algorithm: {"grid", "random"}
-
Grid search will explore every possible combination of parameters supplied to the Experiment. Grid search will keep running until all options have been exhausted.
-
Random search will randomly select a configuration from the possible combinations of parameters, the choice of parameters can be biased by using dictionaries with values representing relative weights between the keys. See Configurations / RNN hyperparameters for more. Random search will run until the num_experiments parameter in Experiment.run() is reached, default=100.
-
-
x_train: sequential (3D) tensor of train input data supplied for a train-test experiment
-
y_train: sequential (2D) tensor of train label data supplied for a train-test experiment, corresponding to the indices of the x_train data
-
x_test: sequential (3D) tensor of test input data supplied for a train-test experiment
-
y_test: sequential (2D) tensor of test label data supplied for a train-test experiment, corresponding to the indices of the x_test data
-
data: tuple of (input, label) data for k-fold cross validation experiment
-
folds: integer to determine k in k-fold validation, defaults to 10. Must be an integer (or left default) for k-fold validation experiment along with data tuple
-
thresholding: Boolean to determine if k-fold test is cut short when accuracy falls below threshold, defaults to False. When thresholding=True, the threshold automatically increases if the average accuracy of k-folds is greater than the threshold. The new threshold is minimum of the set of k-fold accuracies.
-
threshold: 0 <= float < 1 which determines accuracy level cut off during a k-fold experiment. If a fold acheives lower than threshold, the remaining folds are not run, and the next configuration will begin. Automatically increases if a k-fold experiment achieved higher average accuracy (across k-folds) than threshold to the minimum of the set of k-fold accuracies.
-
folder_name: String to name folder in which csv file results are stored
Increase the temporal distance between input features. Add "steps" to parameters dictionary to increase the time interval between data, should be integer <= sequence_length
Average the results of multiple RNN models.Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied
- Pass a list of parameter dictionaries in place of parameters to Ensemble_configurations class to average the results of multiple models. Only the first element in the list of possible parameters will be used if more than one is supplied.
- batch_size: int can be passed to Ensemble_configurations to use the same batch_size across models
Average the results of classifying all sub-sequences and the entire data sequence. Experiment will only search sequence_length space, and will take the first value provided for all other hyperparameters if more than one is supplied
Leave all possible combinations of input features out of training to see impact of their omission. Trains a model then sequentially omits all possible combinations of 1,2,3...n, where n=total number of features, giving 2047 combinations for the 11 features used in the paper.
Leave a single feature out of testing and training.
- supply "leave_out_features" to parameters dictionary to omit a single feature from training and testing
Takes a dictionary of parameters, the training data and testing data as input. Data used to determine shape of RNN layers. Possible options for configurations outlined in Configurations / RNN hyperparameters.
The configuration dictionaries used in the paper are stored in experiments > Configs. The possible parameters which can be edited and passed into an experiment are as detailed in the table below N.B. these are wider than the limitations of the random search configuration. see the commented code for details of each hyperparameter.
Parameter | Possible values | Notes |
---|---|---|
layer_type | "GRU", "LSTM" | fixed as "GRU" in Configs |
loss | "binary_crossentropy" | - |
kernel_initializer | "lecun_uniform" | Can also be any of the initialisers listed in Keras |
recurrent_initializer | "lecun_uniform" | Can also be any of the initialisers listed in Keras |
"depth" | integer => 1 | - |
"bidirectional" | Boolean | - |
"hidden_neurons" | integer => 1 | - |
"learning_rate": | 0 <= float <= 1 | will default to 0.001 if "adam" optimiser used |
"optimiser": | "adam", "sgd" | - |
"dropout": | 0 <= float < 1 | - |
"b_l1_reg": | 0 <= float < 1 | - |
"b_l2_reg": | 0 <= float < 1 | - |
"r_l1_reg": | 0 <= float < 1 | - |
"r_l2_reg": | 0 <= float < 1 | - |
"epochs": | integer > 1 | - |
"sequence_length": | 1 < integer < 300 | - |
"batch_size": | 1 < integer < 59 | - |
"description": | string to describe parameters | only needed for Ensemble_configurations |
"step": | integer => 1 | only needed for Increase_Snaphot_Experiment |
"leave_out_feature": | 0 <= integer < number of input features (here 11) | not necessary for code to work |
Hyperparameters should be supplied as a dictionary with parameter name as the key and value(s) stored in a list or as the keys of dictionaries. If using dictionaries, also supply relative weights representing the frequency with which the values should be chosen in a random search (the frequencies will be ignored in a grid search) - the lists and dictionaries can be mixed together e.g:
{
# more parameters up here
["dropout"]: [0, 0.1, 0.2, 0.3],
["optimiser"]: {"adam": 0.75, "sgd":0.25}, # equivalent to {"adam": 3, "sgd": 1} as weights are relative
["epochs"]: list(range(0,1000)),
# more parameters down here
}
- get_all(): returns the search space used for random search in the paper
- get_A(), get_B(), get_C(): returns configurations A, B, and C respectively as outlined in the paper
- get_A_B_C(): returns configurations A, B, and C as values in a dictionary, keys are "A", "B", and "C"