- Kapline is a platform that uses machine learning to detect and classify malicious applications (only Android apk)
- This is a project for the subjects
Technologies for Advanced Programming
andSocial Media Management
at UniCT. - The core of the project is built on top of quark-engine and quark-rules
- Based on the dataset, the model is trained on 5 types of malware families:
Benign, Riskware, Adware, SMS, Banking
.
The pipeline is structured as follow:
Data | Service |
---|---|
Source | User via Telegram Bot |
Ingestion | Fluentd |
Transport | Apache Kafka |
Storage (input) | httpd |
Processing | Apache Spark |
Storage | Elastic Search |
Visualization | Grafana |
- The frontend is provided by a telegram bot (for simplicity reasons)
- The
telegram bot
container and thehttpd container
share a volume where the files are stored
The bot sends a message to fluentd
in this format:
{
"userid": long,
"filename":string,
"md5": string,
}
The field filename
will be used later to retrieve the file from httpd
.
- Ingestion is provided by
fluentd
- Fluentd exposes a route where it awaits an input event
- In this step, the field
timestamp": date
is added - This component write the message on a
Kafka
topic namedapk_pointers
Data processing is powered by Apache Spark
. The workflow is:
- The file is retrieved from
http://httpd/{filename}
- Then it runs
quark-engine
on the retrieved file and score all crimes - The malware family is predicted through machine learning
- The predicted label is sent to the telegram user who requested the analysis
- A new message is written on a
Kafka
topic calledanalyzed
The structure of the message is the following:
{
"timestamp": date,
"md5": string,
"features": list[double],
"size": long,
"predictedLabel":string
}
Now the message will be enriched with some statistics:
- The rules are grouped by labels (refers to quark-rules/label_desc.csv and utils/extract_labels.py)
- Some partials score are calculated (if the label contains at least 4 rules)
- The data is brought into elasticsearch
The structure of a record in elastic search is:
{
"@timestamp": date,
"calendar_score": double,
"calllog_score": double,
"network_score": double,
...
"max_score": double,
"md5": string,
"size": long,
"predictedLabel": string
}
The dataset was generated through the script /utils/extractor.py on Maldroid dataset. Then a model was trained through logistic regression in which the scoring of each rule is used as a feature.
You can get the jupyter notebook used for training in spark/model_training.ipynb
N.B: At the time I trained the model the rules were 204, so 204 features.
Service | URL |
---|---|
Bot | @nameofthebot |
htppd | http://httpd |
Elastic Search | https://elasticsearch:9200 |
Grafana | https://grafana:3000 |
All environment variables in .env
must be set before running docker-compose
cp .env.dist .env
Run with:
docker-compose up
Just contact the bot and send the APK(s) you want to analyze!
N.B.: There is a limit on the maximum file size that the bot can download (20 MB)