Skip to content

Detect malware with static and dynamic analysis data of exe files.

Notifications You must be signed in to change notification settings

Amaan-Haque/MalwareDetectionMachineLearning

Repository files navigation

Malware Classification

Libraries Used

json pandas os numpy sklearn pickle sys

How to run

Run MalwareDetector.py with command line arguments as the directory with the analysis files.
Eg. python3 MalwareDetector.py ./DATASET

Coding Conventions

  1. The column legit = 0 implies malware and legit = 1 implies benign

  2. The contents of Dynamic_Analysis_Data_Part1 and Dynamic_Analysis_Data_Part1 are combined together appropriately into a single folder Dynamic_Analysis_Data_Part1

  3. The source code used for extracting features on training data will work when put in the specific directory structure from the training data.

Malware Classification

Feature Extraction

Static

Features Extracted

  1. DllCharacteristics
    DLLs provide linkage and execution information of code when loaded into the window. It is also used for analysis of imported libraries and the types of linkage are used in execution of executable files.
  2. DebugSize
    Denotes the size of the debug-directory table. Usually, Microsoft-related executable files have a debug directory. Hence many clean programs may have a non-zero value for DebugSize.
  3. ImageVersion Denotes the version of the file. It is user-definable and not related to the function of the program.Many clean programs have more versions and a larger image-version set. Most malware have an ImageVersionvalue of 0.
  4. IatRVA Denotes the relative-virtual address of the import-address table. The value of this feature is 4096for most clean files and 0 or a very large value for virus files. Many malware may not use import functions ormight obfuscate their import tables
  5. ExportSize Denotes the size of the export table. Usually,only DLLs, not executable programs, have export tables.Hence the value of this feature may be non-zero for clean
  6. ResourceSize Denotes the size of the resource section.Some virus files may have no resources. Clean files may have larger resources
  7. NumberOfSections Denotes the number of sections. The value of this feature varies in both virus and clean files and it is not clear from inspection how this feature helps separate malware and clean files.

Dynamic

Features Extracted

dll loaded

  • Thedll_loaded data is logged under behaviour > summary > dll_loaded

  • A Dynamic-Link Library (DLL) is a module that contains functions (called exported functions or exports) that can be used by another program (such as an Executable or DLL). An executable can use the functions implemented in a DLL by importing it from the DLL.

  • Common and most important DLL’s

    • Kernel32.dll = It is a very common dll that contains core functionality such as access and manipulation of memory , files and hardware
    • advapi32.dll = This DLL provides access to advanced core windows components such as the service manager and registry
    • user32.dll = this dll contain all the user interface components such as buttons , scroll , bars , and components for controlling and responding to user actions
    • gdi32.dll = this dll contains functions for displaying and manipulating graphics
    • ws2_32.dll = it is a networking dll a program that accesses either of these most likely connects to a network or performs network related tasks

Registry Keys

  • There is a lot of information that can be extracted from the Dynamic Analysis by looking at the Registry Keys that a software modifies. The registry keys are used by Windows to store low level settings and Program settings all in one place. Change or deletion of certain keys may hint us about the intent of the software.
  • We extracted the registry keys modified from 2 specific heads
    • Behaviour>summary>regkey_written
    • Behaviour>summary>regkey_deleted
  • We collected the information together and then reduced the list to a top 100 most modified Registry Keys list.
Count : 100
Content:
[('(default)', 2259), ('replaceapps', 426), ('clsid', 238), ('autodetect', 231), ('uncasintranet', 231), ('extension.key', 225), ('extensions.commasep', 225), ('extensions.spacesep', 225), ('start page', 207), ('extension.handler', 192), ('mediatype.description', 192), ('mediatype.descriptionid', 192), ('mediatype.icon', 192), ('extension.mime', 189), ('trappolltimemillisecs', 183), ('maxfilesize', 181), ('filedirectory', 181), ('enablefiletracing', 181), ('enableconsoletracing', 181), ('consoletracingmask', 181), ('filetracingmask', 181), ('mcihandler', 180), ('attributes', 174), ('start', 131), ('languagelist', 131), ('programscache', 91), ('superiorapps', 90), ('{871c5380-42a0-1069-a2ea-08002b30309d}', 73), ('friendlytypename', 72), ('restrictanonymous', 71), ('enabledcom', 71), ('1', 71), ('version', 69), ('extensions', 69), ('1601', 68), ('mime types', 66), ('infotip', 62), ('globalassocchangedcounter', 62), ('nointerneticon', 62), ('hidefolderverbs', 62), ('hideondesktopperuser', 62), ('wantsparsedisplayname', 62), ('ie4uinit.exe,-731', 61), ('infotext', 56), ('superhidden', 56), ('showsuperhidden', 55), ('search page', 53), ('unregmp2.exe,-4', 53), ('searchassistant', 51), ('devicecenter.dll,-1000', 50), ('9', 50), ('window title', 50), ('sud.dll,-1', 49), ('explorer.exe,-7021', 48), ('proxyenable', 42), ('savedlegacysettings', 42), ('search bar', 42), ('threadingmodel', 39), ('animation', 39), ('requiredfile', 39), ('printer', 36), ('cdafile2', 36), ('hidefileext', 34), ('checkedvalue', 33), ('defaultvalue', 32), ('checksetting', 32), ('{k7c0db872a3f777c0}', 31), ('use search asst', 28), ('source filter', 27), ('meltme', 26), ('0', 25), ('{fbf23b40-e3f0-101b-8488-00aa003e56f8} {000214f9-0000-0000-c000-000000000046} 0xffff', 25), ('10', 25), ('windows media player service', 24), ('compaq service drivers', 24), ('compatibilityflags', 24), ('vercache', 24), ('mimetype', 24), ('enabled', 23), ('5', 23), ('searchlist', 21), ('enableremoteconnect', 21), ('deadgwdetectdefault', 21), ('usedomainnamedevolution', 21), ('forwardbroadcasts', 21), ('allowunqualifiedquery', 21), ('nameserver', 21), ('dontadddefaultgatewaydefault', 21), ('autoshareserver', 21), ('transportbindname', 21), ('domain', 21), ('autosharewks', 21), ('enableicmpredirect', 21), ('enablesecurityfilters', 21), ('ipenablerouter', 21), ('prioritizerecorddata', 21), ('id', 21), ('ffpfastforwardingcachesize', 20), ('largebuffersize', 20), ('priorityboost', 20)]

Domain Name

  • The domains visited under network>domains tell us about the domains that the software makes connections to.
  • We collected the multiples values under the header network>domain in the test files and performed text processing to change the following
    • Multiple domains are space separated and have similar names tokenization was done to remove whitespace and to compile a list of top 100 most common domain names
    • This feature was further not considered due to the heavy amount of preprocessing that had to be performed to make a uniform list in the contest time.
Count : 100
Content :
[('cc.iitk.ac.in', 13098), ('iitk.ac.in', 13098), ('junta.iitk.ac.in', 13098), ('mirror5.internetdownloadmanager.com', 4961), ('secure.internetdownloadmanager.com', 4961), ('registeridm.com', 4961), ('mirror3.internetdownloadmanager.com', 4961), ('www.internetdownloadmanager.com', 4961), ('test.internetdownloadmanager.com', 4961), ('teredo.ipv6.microsoft.com', 4961), ('dns.msftncsi.com', 4956), ('_googlecast._tcp.local', 4948), ('clientservices.googleapis.com', 4713), ('www.google.com', 4691), ('www.google.co.in', 4505), ('wpad.cse.iitk.ac.in', 3013), ('wpad.openstacklocal', 1945), ('isatap.cse.iitk.ac.in', 1514), ('isatap.openstacklocal', 1067), ('zexhuvkamyrvm.cse.iitk.ac.in', 1018), ('aymwknwl.cse.iitk.ac.in', 1015), ('nylnoou.cse.iitk.ac.in', 1015), ('zexhuvkamyrvm.openstacklocal', 735), ('aymwknwl.openstacklocal', 731), ('nylnoou.openstacklocal', 731), ('amyrvmcgszqobp.cse.iitk.ac.in', 338), ('uasmzexh.cse.iitk.ac.in', 260), ('nwljtfnyl.cse.iitk.ac.in', 251), ('amyrvmcgszqobp.openstacklocal', 216), ('iobpfpratk.cse.iitk.ac.in', 214), ('akdvrzacozqw.cse.iitk.ac.in', 211), ('ztijowokk.cse.iitk.ac.in', 209), ('uasmzexh.openstacklocal', 171), ('nwljtfnyl.openstacklocal', 169), ('zqwkagzti.cse.iitk.ac.in', 154), ('okklrgb.cse.iitk.ac.in', 154), ('fpratklcvakdvrz.cse.iitk.ac.in', 154), ('iobpfpratk.openstacklocal', 152), ('tsziobplqsatk.cse.iitk.ac.in', 151), ('ztijowokk.openstacklocal', 151), ('akdvrzacozqw.openstacklocal', 151), ('irdvrza.cse.iitk.ac.in', 146), ('qwkageobjowokx.cse.iitk.ac.in', 145), ('smzexhuvkamyr.cse.iitk.ac.in', 144), ('gszqobphcsau.cse.iitk.ac.in', 144), ('tfnylno.cse.iitk.ac.in', 143), ('qwkageobjowokx.openstacklocal', 113), ('irdvrza.openstacklocal', 113), ('tsziobplqsatk.openstacklocal', 113), ('tinypic.com', 96), ('match.com', 96), ('daum.net', 96), ('smzexhuvkamyr.openstacklocal', 91), ('gszqobphcsau.openstacklocal', 91), ('tfnylno.openstacklocal', 89), ('dvrzadatqwka.cse.iitk.ac.in', 84), ('bjowokx.cse.iitk.ac.in', 84), ('obplqsatklc.cse.iitk.ac.in', 84), ('asftbxh.cse.iitk.ac.in', 77), ('osjrvmcgtciobp.cse.iitk.ac.in', 77), ('jtfnrzi.cse.iitk.ac.in', 77), ('kosjrvmcgtciob.cse.iitk.ac.in', 70), ('mhtklcjrelvr.cse.iitk.ac.in', 70), ('rzioouasftbx.cse.iitk.ac.in', 70), ('ilikearts.com', 70), ('artsbizworld.com', 70), ('realquickmedia.com', 70), ('fpratklcvakdvrz.openstacklocal', 67), ('okklrgb.openstacklocal', 67), ('zqwkagzti.openstacklocal', 67), ('oouasmzexh.cse.iitk.ac.in', 67), ('qgszayip.cse.iitk.ac.in', 66), ('rtqicigvsbjjbik.cse.iitk.ac.in', 65), ('apwbcbrrdfu.cse.iitk.ac.in', 65), ('jtfnrzi.openstacklocal', 65), ('asftbxh.openstacklocal', 64), ('ymwknwljtfn.cse.iitk.ac.in', 64), ('osjrvmcgtciobp.openstacklocal', 63), ('sqgszqotdwcsau.cse.iitk.ac.in', 58), ('dvrzadatqwka.openstacklocal', 51), ('obplqsatklc.openstacklocal', 50), ('bjowokx.openstacklocal', 50), ('tvmqgszqobp.cse.iitk.ac.in', 49), ('rtqzmagvsbjdkok.cse.iitk.ac.in', 47), ('atylcbrrdqe.cse.iitk.ac.in', 47), ('igvsbjjbikzthg.cse.iitk.ac.in', 44), ('rrdfukagrt.cse.iitk.ac.in', 43), ('ayipwcsapw.cse.iitk.ac.in', 43), ('qgszayip.openstacklocal', 41), ('apwbcbrrdfu.openstacklocal', 40), ('rtqicigvsbjjbik.openstacklocal', 40), ('oouasmzexh.openstacklocal', 40), ('mediaartsplaza.com', 40), ('theheroarts.com', 40), ('superartsacademy.com', 40), ('ikea.com', 39), ('ymwknwljtfn.openstacklocal', 38), ('sitesell.com', 38), ('google.ae', 38), ('knwlpyhnyl.cse.iitk.ac.in', 36)]

Feature Engineering

Bag of Words [Sentiment Analysis]

Computing the bag-of-words representation for a corpus of documents consists of the following three steps:

  1. Tokenization. Split each document into the words that appear in it (called tokens), for example by splitting them on whitespace and punctuation.

  2. Vocabulary building. Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order)

  3. Encoding. For each document, count how often each of the words in the vocabulary appear in this document.

Vocabulary Found [On a random split of the data]

Vocabulary size: 433
Vocabulary content:
 {'mpr': 223, 'dll': 110, 'imm32': 171, 'windows': 404, 'system32': 359, 'user32': 375, 'netmsg': 258, 'api': 36, 'ms': 224, 'win': 402, 'service': 329, 'management': 208, ............................... 'console': 70, 'namedpipe': 251, 'rtlsupport': 317, 'handle': 153, 'memory': 215, 'misc': 219, 'debug': 95, 'errorhandling': 125, 'file': 135, 'kernelbase': 190, 'profile': 295, 'util': 380, 'libraryloader': 195, 'isdone': 183, 'idp': 164, 'idmshellext': 162, 'idmnetmon': 161}

About

Detect malware with static and dynamic analysis data of exe files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages