Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace hashing function with faster implementation #128

Open
exp0se opened this issue Aug 20, 2015 · 5 comments
Open

Replace hashing function with faster implementation #128

exp0se opened this issue Aug 20, 2015 · 5 comments

Comments

@exp0se
Copy link
Contributor

exp0se commented Aug 20, 2015

Hey,
I found that hashing function used throughout Kansa is ineffective and slow. It's okay for small operations like getting hashes of processes, but very slow when you use hashing for a path or for a whole disk.
Basically the problem is with ReadAllBytes function that you are using. In my tests i was able to get significant performance improvement by using function with StreamReader IO instead.
Can you take a look at this commit - exp0se@13b9aba
I mess up my commit and a whole bunch of other stuff also get commited - you only need to look at hashing related changes.
Can you consider replacing Kansa hashing function? I could send you a pull request later if you agree.

@jvaldezjr1
Copy link
Contributor

I think it's a simple 1 line to change if I'm understanding this right:

#$fileData = [System.IO.File]::ReadAllBytes($FileName)
  $fileData = ([IO.StreamReader]$FileName).BaseStream

I did some research here: http://learn-powershell.net/2013/03/25/use-powershell-to-calculate-the-hash-of-a-file/
On my Windows 7 host, I've been testing a portion of the ProcsNModules script that hashes the DLLs associated with each process using SHA256. My first test without IO.Streamreader:
Days : 0 Hours : 0 Minutes : 0 Seconds : 13 Milliseconds : 308 Ticks : 133089160 TotalDays : 0.00015403837962963 TotalHours : 0.00369692111111111 TotalMinutes : 0.221815266666667 TotalSeconds : 13.308916 TotalMilliseconds : 13308.916

and again with IO.StreamReader:
Days : 0 Hours : 0 Minutes : 0 Seconds : 12 Milliseconds : 927 Ticks : 129271589 TotalDays : 0.000149619894675926 TotalHours : 0.00359087747222222 TotalMinutes : 0.215452648333333 TotalSeconds : 12.9271589 TotalMilliseconds : 12927.1589

Thats not much of an increase, but in a larger environment, perhaps a bigger improvement? @exp0se is that the function you were referring to?

@davehull
Copy link
Owner

davehull commented Sep 9, 2015

What does the performance difference look like across a bunch of tests? Those times are so close, it could be due to other activity on the host.

-----Original Message-----
From: "Juan Romero" [email protected]
Sent: ‎9/‎8/‎2015 23:00
To: "davehull/Kansa" [email protected]
Subject: Re: [Kansa] Replace hashing function with faster implementation(#128)

I think it's a simple 1 line to change if I'm understanding this right:
#$fileData = [System.IO.File]::ReadAllBytes($FileName)
$fileData = ([IO.StreamReader]$FileName).BaseStreamI did some research here: http://learn-powershell.net/2013/03/25/use-powershell-to-calculate-the-hash-of-a-file/
On my Windows 7 host, I've been testing a portion of the ProcsNModules script that hashes the DLLs associated with each process using SHA256. My first test without IO.Streamreader:
Days : 0
Hours : 0
Minutes : 0
Seconds : 13
Milliseconds : 308
Ticks : 133089160
TotalDays : 0.00015403837962963
TotalHours : 0.00369692111111111
TotalMinutes : 0.221815266666667
TotalSeconds : 13.308916
TotalMilliseconds : 13308.916
and again with IO.StreamReader:
Days : 0
Hours : 0
Minutes : 0
Seconds : 12
Milliseconds : 927
Ticks : 129271589
TotalDays : 0.000149619894675926
TotalHours : 0.00359087747222222
TotalMinutes : 0.215452648333333
TotalSeconds : 12.9271589
TotalMilliseconds : 12927.1589
Thats not much of an increase, but in a larger environment, perhaps a bigger improvement? @exp0se is that the function you were referring to?

Reply to this email directly or view it on GitHub.

@jvaldezjr1
Copy link
Contributor

True, I was cross eyed late last night when looking at it. I can try to run a few modules that do hashing and see. I don't have access to a larger environment to test this on at the moment.

@jvaldezjr1
Copy link
Contributor

I did a little more research on MSDN and our favorite search engine. Some discussion I found mentioned that the real difference between the 2 are how they actually handle files. Both classes are in the System.IO namespace. I'll sum up what I've found, you can probably verify with some guys in Redmond which is better. IO.File deals with arrays of bytes, and can write text to files from pre-allocated buffers or arrays of strings. This implies a huge memory hit when the file being read is large (in GB). ReadAllBytes specifically reads the entire file into an array, then closes the file. I also found discussion that the File class methods are wrappers to StreamReader/Writer methods.

StreamReader/Writer can read and write strings and bytes, and you get around the memory hit by reading lines at a time (which will take multiple calls for larger files). What I can't figure out here is if the code is doing essentially the same thing- allocating the entire file as a stream (based on the way the StreamReader constructor is instantiated) because the base underlying stream is still being accessed. If so, if the file is small, it may not really matter. Maybe its a case of "6 to one, half dozen to another"?

@exp0se
Copy link
Contributor Author

exp0se commented Sep 26, 2015

Sorry for late reply - was busy at work.
Here is what i tested:
Kansa Get-Fileshashes module

PS C:\Users\exp0se\Downloads\Kansa-master> Measure-Command {.\Modules\Disk\Get-FileHashes.ps1 MD5 C:\Windows}


Days              : 0
Hours             : 0
Minutes           : 42
Seconds           : 52
Milliseconds      : 561
Ticks             : 25725613939
TotalDays         : 0,0297750161331019
TotalHours        : 0,714600387194444
TotalMinutes      : 42,8760232316667
TotalSeconds      : 2572,5613939
TotalMilliseconds : 2572561,3939

I suspect the problem here is Workflows rather than hashing function, it also consumes tons of resources(few gigs or ram, a lot of cpu as well) making it impossible to run during working hours.

Here is my alternative function from exp0se@e078012

PS C:\Users\exp0se\Downloads\Kansa-master> Measure-Command {.\Modules\Disk\Get-FileHash.ps1 C:\Windows MD5}
Get-ChildItem : Access to the path 'C:\Windows\CSC\v2.0.6' is denied.
At C:\Users\exp0se\Downloads\Kansa-master\Modules\Disk\Get-FileHash.ps1:155 char:1
+ Get-ChildItem -Path $BasePath -Recurse |Where-Object {$_.Name -match $extRegex}| ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : PermissionDenied: (C:\Windows\CSC\v2.0.6:String) [Get-ChildItem], UnauthorizedAccessEx
   ption
    + FullyQualifiedErrorId : DirUnauthorizedAccessError,Microsoft.PowerShell.Commands.GetChildItemCommand

WARNING: Cannot calculate hash for directory: C:\Windows\Panther\setup.exe
Get-ChildItem : Access to the path 'C:\Windows\System32\LogFiles\WMI\RtBackup' is denied.
At C:\Users\exp0se\Downloads\Kansa-master\Modules\Disk\Get-FileHash.ps1:155 char:1
+ Get-ChildItem -Path $BasePath -Recurse |Where-Object {$_.Name -match $extRegex}| ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : PermissionDenied: (C:\Windows\Syst...es\WMI\RtBackup:String) [Get-ChildItem], Unauthor
   edAccessException
    + FullyQualifiedErrorId : DirUnauthorizedAccessError,Microsoft.PowerShell.Commands.GetChildItemCommand



Days              : 0
Hours             : 0
Minutes           : 3
Seconds           : 24
Milliseconds      : 590
Ticks             : 2045902634
TotalDays         : 0,00236794286342593
TotalHours        : 0,0568306287222222
TotalMinutes      : 3,40983772333333
TotalSeconds      : 204,5902634
TotalMilliseconds : 204590,2634

Some hashing function benchmarks
Built-in function into powershell 4

PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { ls | Get-FileHash -Algorithm MD5 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 144
Ticks             : 1440400
TotalDays         : 1,66712962962963E-06
TotalHours        : 4,00111111111111E-05
TotalMinutes      : 0,00240066666666667
TotalSeconds      : 0,14404
TotalMilliseconds : 144,04

Function from my code:

PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { ls | Get-FileHashCustom -Algorithm MD5 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 205
Ticks             : 2053184
TotalDays         : 2,37637037037037E-06
TotalHours        : 5,70328888888889E-05
TotalMinutes      : 0,00342197333333333
TotalSeconds      : 0,2053184
TotalMilliseconds : 205,3184

Get-hashes function from Kansa module Disk\Get-Filehashes

PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { Get-Hashes -BasePath . -HashType MD5 }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 147
Ticks             : 1473163
TotalDays         : 1,70504976851852E-06
TotalHours        : 4,09211944444444E-05
TotalMinutes      : 0,00245527166666667
TotalSeconds      : 0,1473163
TotalMilliseconds : 147,3163

Turns out my function is even slower, well i guess we need to rename an issues to fix Get-Filehashes module rather replace hashing function as initially i tested it with Get-Filehashes full module.
Still i would prefer IO.ReadAllBytes to be replaced with io.streamreader everywhere since reading everything into memory is not a good practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants