Add RNN example to perform DGA detection with LSTMs #240

rcurtin · 2025-01-09T13:20:55Z

This PR adds an example that uses two RNNs to solve the DGA detection problem. In short it is a classification problem to determine whether or not a domain was generated by a domain generation algorithm.

The model uses two RNNs: one trained on benign domains, and one trained on malicious domains. For prediction, we compute the likelihood of a domain coming from each model, and then we take the more likely model as the class. This is the generalized likelihood ratio test or something very much like it. We used this strategy at Symantec and a more complicated version of it in a paper.

The network structure is simple: it uses float data instead of double data and consists of 50 LSTM units followed by a 39-element linear layer and log-softmax.

I plan to use this as a demonstration of how to deploy a predictive model inside of a Docker container.

When the model is trained and run, here is the output:

$ ./lstm_dga_detection_train dga_domains.csv 
File 'dga_domains.csv' has 80000 benign domains with a maximum length of 64, and 80000 malicious domains with a maximum length of 36.
Epoch 1/5
72000/72000 [====================================================================================================] 100% - 381.067s/epoch; 5ms/step; loss: 28.8829
Epoch 2/5
72000/72000 [====================================================================================================] 100% - 457.332s/epoch; 6ms/step; loss: 26.0513
Epoch 3/5
72000/72000 [====================================================================================================] 100% - 508.525s/epoch; 7ms/step; loss: 25.8233
Epoch 4/5
72000/72000 [====================================================================================================] 100% - 535.384s/epoch; 7ms/step; loss: 25.6701
Epoch 5/5
72000/72000 [====================================================================================================] 100% - 759.608s/epoch; 10ms/step; loss: 25.4866
Epoch 1/5
72000/72000 [====================================================================================================] 100% - 759.118s/epoch; 10ms/step; loss: 58.4937
Epoch 2/5
72000/72000 [====================================================================================================] 100% - 729.818s/epoch; 10ms/step; loss: 49.2438
Epoch 3/5
72000/72000 [====================================================================================================] 100% - 843.702s/epoch; 11ms/step; loss: 49.1188
Epoch 4/5
72000/72000 [====================================================================================================] 100% - 932.335s/epoch; 12ms/step; loss: 49.368
Epoch 5/5
72000/72000 [====================================================================================================] 100% - 903.086s/epoch; 12ms/step; loss: 49.4524
Model performance:
  Training accuracy: 141157 of 144000 correct (98.0257%).
  Test accuracy:     15683 of 16000 correct (98.0187%).

The size of the resulting models is small:

$ ls -lh *.bin
-rw-rw-r-- 1 ryan ryan 80K Jan  8 23:21 lstm_dga_detector_benign.bin
-rw-rw-r-- 1 ryan ryan 80K Jan  8 23:21 lstm_dga_detector_malicious.bin

And the prediction program works like below, where I write a domain name and then the prediction is printed (or an error if the domain name was invalid):

$ ./lstm_dga_detection_predict lstm_dga_detector_benign.bin lstm_dga_detector_malicious.bin 
www.mlpack.org
benign
asd98udvsa908usad98uf234.org
malicious
this IS a domain with invalid characters
Domain 'this IS a domain with invalid characters' has invalid character ' '!
$

For the code to run correctly, the following PRs must first be merged in mlpack:

github-actions · 2025-01-09T13:21:08Z

👈 Launch a binder notebook on branch rcurtin/examples/dga-detection

shrit · 2025-01-09T18:03:19Z

cpp/lstm/dga_detection/lstm_test.cpp

@@ -0,0 +1,123 @@
+/**
+ * @file lstm_dga_detection_train.cpp


The file name is not identical to this one.

What is the point of this one as well ? given that we have the train file above ?

Ack, I thought I removed it! Sorry about that. Fixed in 6169ab1.

shrit · 2025-01-09T18:03:56Z

scripts/download_data_set.py

    test_labels = requests.get(
        "https://datasets.mlpack.org/mnist/t10k-labels-idx1-ubyte.gz")
    progress_bar("test_labels.gz", test_labels)
    ungzip("test_labels.gz", "test_labels.ubytes")
-  
+


Thanks 👍

rcurtin · 2025-01-09T18:06:26Z

This should wait for merge on those other 3 PRs.

github-actions

Second approval provided automatically after 24 hours. 👍

rcurtin added 3 commits January 1, 2025 21:09

Add implementation of DGA detector with RNNs.

0e9ead6

Update DGA detection example to its final working state.

12640c5

Add a script to download the DGA data.

0edc2af

rcurtin mentioned this pull request Jan 9, 2025

Add support for weights to data::Split() and allow cube types. mlpack/mlpack#3869

Merged

shrit approved these changes Jan 9, 2025

View reviewed changes

Remove file that I thought I removed before...

6169ab1

github-actions bot approved these changes Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RNN example to perform DGA detection with LSTMs #240

Add RNN example to perform DGA detection with LSTMs #240

rcurtin commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

shrit Jan 9, 2025

rcurtin Jan 9, 2025

shrit Jan 9, 2025

rcurtin commented Jan 9, 2025

github-actions bot left a comment

Add RNN example to perform DGA detection with LSTMs #240

Are you sure you want to change the base?

Add RNN example to perform DGA detection with LSTMs #240

Conversation

rcurtin commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

shrit Jan 9, 2025

Choose a reason for hiding this comment

rcurtin Jan 9, 2025

Choose a reason for hiding this comment

shrit Jan 9, 2025

Choose a reason for hiding this comment

rcurtin commented Jan 9, 2025

github-actions bot left a comment

Choose a reason for hiding this comment