json
pandas
os
numpy
sklearn
pickle
sys
Run MalwareDetector.py with command line arguments as the directory with the analysis files.
Eg. python3 MalwareDetector.py ./DATASET
-
The column
legit
= 0 impliesmalware
andlegit
= 1 impliesbenign
-
The contents of
Dynamic_Analysis_Data_Part1
andDynamic_Analysis_Data_Part1
are combined together appropriately into a single folderDynamic_Analysis_Data_Part1
-
The source code used for extracting features on training data will work when put in the specific directory structure from the training data.
- DllCharacteristics
DLLs provide linkage and execution information of code when loaded into the window. It is also used for analysis of imported libraries and the types of linkage are used in execution of executable files. - DebugSize
Denotes the size of the debug-directory table. Usually, Microsoft-related executable files have a debug directory. Hence many clean programs may have a non-zero value for DebugSize. - ImageVersion Denotes the version of the file. It is user-definable and not related to the function of the program.Many clean programs have more versions and a larger image-version set. Most malware have an ImageVersionvalue of 0.
- IatRVA Denotes the relative-virtual address of the import-address table. The value of this feature is 4096for most clean files and 0 or a very large value for virus files. Many malware may not use import functions ormight obfuscate their import tables
- ExportSize Denotes the size of the export table. Usually,only DLLs, not executable programs, have export tables.Hence the value of this feature may be non-zero for clean
- ResourceSize Denotes the size of the resource section.Some virus files may have no resources. Clean files may have larger resources
- NumberOfSections Denotes the number of sections. The value of this feature varies in both virus and clean files and it is not clear from inspection how this feature helps separate malware and clean files.
-
The
dll_loaded
data is logged underbehaviour
>summary
>dll_loaded
-
A Dynamic-Link Library (DLL) is a module that contains functions (called exported functions or exports) that can be used by another program (such as an Executable or DLL). An executable can use the functions implemented in a DLL by importing it from the DLL.
-
Common and most important DLL’s
- Kernel32.dll = It is a very common dll that contains core functionality such as access and manipulation of memory , files and hardware
- advapi32.dll = This DLL provides access to advanced core windows components such as the service manager and registry
- user32.dll = this dll contain all the user interface components such as buttons , scroll , bars , and components for controlling and responding to user actions
- gdi32.dll = this dll contains functions for displaying and manipulating graphics
- ws2_32.dll = it is a networking dll a program that accesses either of these most likely connects to a network or performs network related tasks
- There is a lot of information that can be extracted from the Dynamic Analysis by looking at the Registry Keys that a software modifies. The registry keys are used by Windows to store low level settings and Program settings all in one place. Change or deletion of certain keys may hint us about the intent of the software.
- We extracted the registry keys modified from 2 specific heads
Behaviour
>summary
>regkey_written
Behaviour
>summary
>regkey_deleted
- We collected the information together and then reduced the list to a top 100 most modified Registry Keys list.
Count : 100
Content:
[('(default)', 2259), ('replaceapps', 426), ('clsid', 238), ('autodetect', 231), ('uncasintranet', 231), ('extension.key', 225), ('extensions.commasep', 225), ('extensions.spacesep', 225), ('start page', 207), ('extension.handler', 192), ('mediatype.description', 192), ('mediatype.descriptionid', 192), ('mediatype.icon', 192), ('extension.mime', 189), ('trappolltimemillisecs', 183), ('maxfilesize', 181), ('filedirectory', 181), ('enablefiletracing', 181), ('enableconsoletracing', 181), ('consoletracingmask', 181), ('filetracingmask', 181), ('mcihandler', 180), ('attributes', 174), ('start', 131), ('languagelist', 131), ('programscache', 91), ('superiorapps', 90), ('{871c5380-42a0-1069-a2ea-08002b30309d}', 73), ('friendlytypename', 72), ('restrictanonymous', 71), ('enabledcom', 71), ('1', 71), ('version', 69), ('extensions', 69), ('1601', 68), ('mime types', 66), ('infotip', 62), ('globalassocchangedcounter', 62), ('nointerneticon', 62), ('hidefolderverbs', 62), ('hideondesktopperuser', 62), ('wantsparsedisplayname', 62), ('ie4uinit.exe,-731', 61), ('infotext', 56), ('superhidden', 56), ('showsuperhidden', 55), ('search page', 53), ('unregmp2.exe,-4', 53), ('searchassistant', 51), ('devicecenter.dll,-1000', 50), ('9', 50), ('window title', 50), ('sud.dll,-1', 49), ('explorer.exe,-7021', 48), ('proxyenable', 42), ('savedlegacysettings', 42), ('search bar', 42), ('threadingmodel', 39), ('animation', 39), ('requiredfile', 39), ('printer', 36), ('cdafile2', 36), ('hidefileext', 34), ('checkedvalue', 33), ('defaultvalue', 32), ('checksetting', 32), ('{k7c0db872a3f777c0}', 31), ('use search asst', 28), ('source filter', 27), ('meltme', 26), ('0', 25), ('{fbf23b40-e3f0-101b-8488-00aa003e56f8} {000214f9-0000-0000-c000-000000000046} 0xffff', 25), ('10', 25), ('windows media player service', 24), ('compaq service drivers', 24), ('compatibilityflags', 24), ('vercache', 24), ('mimetype', 24), ('enabled', 23), ('5', 23), ('searchlist', 21), ('enableremoteconnect', 21), ('deadgwdetectdefault', 21), ('usedomainnamedevolution', 21), ('forwardbroadcasts', 21), ('allowunqualifiedquery', 21), ('nameserver', 21), ('dontadddefaultgatewaydefault', 21), ('autoshareserver', 21), ('transportbindname', 21), ('domain', 21), ('autosharewks', 21), ('enableicmpredirect', 21), ('enablesecurityfilters', 21), ('ipenablerouter', 21), ('prioritizerecorddata', 21), ('id', 21), ('ffpfastforwardingcachesize', 20), ('largebuffersize', 20), ('priorityboost', 20)]
Domain Name
- The domains visited under
network
>domains
tell us about the domains that the software makes connections to. - We collected the multiples values under the header
network
>domain
in the test files and performed text processing to change the following- Multiple domains are space separated and have similar names tokenization was done to remove whitespace and to compile a list of top 100 most common domain names
- This feature was further not considered due to the heavy amount of preprocessing that had to be performed to make a uniform list in the contest time.
Count : 100
Content :
[('cc.iitk.ac.in', 13098), ('iitk.ac.in', 13098), ('junta.iitk.ac.in', 13098), ('mirror5.internetdownloadmanager.com', 4961), ('secure.internetdownloadmanager.com', 4961), ('registeridm.com', 4961), ('mirror3.internetdownloadmanager.com', 4961), ('www.internetdownloadmanager.com', 4961), ('test.internetdownloadmanager.com', 4961), ('teredo.ipv6.microsoft.com', 4961), ('dns.msftncsi.com', 4956), ('_googlecast._tcp.local', 4948), ('clientservices.googleapis.com', 4713), ('www.google.com', 4691), ('www.google.co.in', 4505), ('wpad.cse.iitk.ac.in', 3013), ('wpad.openstacklocal', 1945), ('isatap.cse.iitk.ac.in', 1514), ('isatap.openstacklocal', 1067), ('zexhuvkamyrvm.cse.iitk.ac.in', 1018), ('aymwknwl.cse.iitk.ac.in', 1015), ('nylnoou.cse.iitk.ac.in', 1015), ('zexhuvkamyrvm.openstacklocal', 735), ('aymwknwl.openstacklocal', 731), ('nylnoou.openstacklocal', 731), ('amyrvmcgszqobp.cse.iitk.ac.in', 338), ('uasmzexh.cse.iitk.ac.in', 260), ('nwljtfnyl.cse.iitk.ac.in', 251), ('amyrvmcgszqobp.openstacklocal', 216), ('iobpfpratk.cse.iitk.ac.in', 214), ('akdvrzacozqw.cse.iitk.ac.in', 211), ('ztijowokk.cse.iitk.ac.in', 209), ('uasmzexh.openstacklocal', 171), ('nwljtfnyl.openstacklocal', 169), ('zqwkagzti.cse.iitk.ac.in', 154), ('okklrgb.cse.iitk.ac.in', 154), ('fpratklcvakdvrz.cse.iitk.ac.in', 154), ('iobpfpratk.openstacklocal', 152), ('tsziobplqsatk.cse.iitk.ac.in', 151), ('ztijowokk.openstacklocal', 151), ('akdvrzacozqw.openstacklocal', 151), ('irdvrza.cse.iitk.ac.in', 146), ('qwkageobjowokx.cse.iitk.ac.in', 145), ('smzexhuvkamyr.cse.iitk.ac.in', 144), ('gszqobphcsau.cse.iitk.ac.in', 144), ('tfnylno.cse.iitk.ac.in', 143), ('qwkageobjowokx.openstacklocal', 113), ('irdvrza.openstacklocal', 113), ('tsziobplqsatk.openstacklocal', 113), ('tinypic.com', 96), ('match.com', 96), ('daum.net', 96), ('smzexhuvkamyr.openstacklocal', 91), ('gszqobphcsau.openstacklocal', 91), ('tfnylno.openstacklocal', 89), ('dvrzadatqwka.cse.iitk.ac.in', 84), ('bjowokx.cse.iitk.ac.in', 84), ('obplqsatklc.cse.iitk.ac.in', 84), ('asftbxh.cse.iitk.ac.in', 77), ('osjrvmcgtciobp.cse.iitk.ac.in', 77), ('jtfnrzi.cse.iitk.ac.in', 77), ('kosjrvmcgtciob.cse.iitk.ac.in', 70), ('mhtklcjrelvr.cse.iitk.ac.in', 70), ('rzioouasftbx.cse.iitk.ac.in', 70), ('ilikearts.com', 70), ('artsbizworld.com', 70), ('realquickmedia.com', 70), ('fpratklcvakdvrz.openstacklocal', 67), ('okklrgb.openstacklocal', 67), ('zqwkagzti.openstacklocal', 67), ('oouasmzexh.cse.iitk.ac.in', 67), ('qgszayip.cse.iitk.ac.in', 66), ('rtqicigvsbjjbik.cse.iitk.ac.in', 65), ('apwbcbrrdfu.cse.iitk.ac.in', 65), ('jtfnrzi.openstacklocal', 65), ('asftbxh.openstacklocal', 64), ('ymwknwljtfn.cse.iitk.ac.in', 64), ('osjrvmcgtciobp.openstacklocal', 63), ('sqgszqotdwcsau.cse.iitk.ac.in', 58), ('dvrzadatqwka.openstacklocal', 51), ('obplqsatklc.openstacklocal', 50), ('bjowokx.openstacklocal', 50), ('tvmqgszqobp.cse.iitk.ac.in', 49), ('rtqzmagvsbjdkok.cse.iitk.ac.in', 47), ('atylcbrrdqe.cse.iitk.ac.in', 47), ('igvsbjjbikzthg.cse.iitk.ac.in', 44), ('rrdfukagrt.cse.iitk.ac.in', 43), ('ayipwcsapw.cse.iitk.ac.in', 43), ('qgszayip.openstacklocal', 41), ('apwbcbrrdfu.openstacklocal', 40), ('rtqicigvsbjjbik.openstacklocal', 40), ('oouasmzexh.openstacklocal', 40), ('mediaartsplaza.com', 40), ('theheroarts.com', 40), ('superartsacademy.com', 40), ('ikea.com', 39), ('ymwknwljtfn.openstacklocal', 38), ('sitesell.com', 38), ('google.ae', 38), ('knwlpyhnyl.cse.iitk.ac.in', 36)]
Computing the bag-of-words representation for a corpus of documents consists of the following three steps:
-
Tokenization. Split each document into the words that appear in it (called tokens), for example by splitting them on whitespace and punctuation.
-
Vocabulary building. Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order)
-
Encoding. For each document, count how often each of the words in the vocabulary appear in this document.
Vocabulary size: 433
Vocabulary content:
{'mpr': 223, 'dll': 110, 'imm32': 171, 'windows': 404, 'system32': 359, 'user32': 375, 'netmsg': 258, 'api': 36, 'ms': 224, 'win': 402, 'service': 329, 'management': 208, ............................... 'console': 70, 'namedpipe': 251, 'rtlsupport': 317, 'handle': 153, 'memory': 215, 'misc': 219, 'debug': 95, 'errorhandling': 125, 'file': 135, 'kernelbase': 190, 'profile': 295, 'util': 380, 'libraryloader': 195, 'isdone': 183, 'idp': 164, 'idmshellext': 162, 'idmnetmon': 161}