-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace hashing function with faster implementation #128
Comments
I think it's a simple 1 line to change if I'm understanding this right: #$fileData = [System.IO.File]::ReadAllBytes($FileName)
$fileData = ([IO.StreamReader]$FileName).BaseStream I did some research here: http://learn-powershell.net/2013/03/25/use-powershell-to-calculate-the-hash-of-a-file/ and again with IO.StreamReader: Thats not much of an increase, but in a larger environment, perhaps a bigger improvement? @exp0se is that the function you were referring to? |
What does the performance difference look like across a bunch of tests? Those times are so close, it could be due to other activity on the host. -----Original Message----- I think it's a simple 1 line to change if I'm understanding this right: |
True, I was cross eyed late last night when looking at it. I can try to run a few modules that do hashing and see. I don't have access to a larger environment to test this on at the moment. |
I did a little more research on MSDN and our favorite search engine. Some discussion I found mentioned that the real difference between the 2 are how they actually handle files. Both classes are in the System.IO namespace. I'll sum up what I've found, you can probably verify with some guys in Redmond which is better. IO.File deals with arrays of bytes, and can write text to files from pre-allocated buffers or arrays of strings. This implies a huge memory hit when the file being read is large (in GB). ReadAllBytes specifically reads the entire file into an array, then closes the file. I also found discussion that the File class methods are wrappers to StreamReader/Writer methods. StreamReader/Writer can read and write strings and bytes, and you get around the memory hit by reading lines at a time (which will take multiple calls for larger files). What I can't figure out here is if the code is doing essentially the same thing- allocating the entire file as a stream (based on the way the StreamReader constructor is instantiated) because the base underlying stream is still being accessed. If so, if the file is small, it may not really matter. Maybe its a case of "6 to one, half dozen to another"? |
Sorry for late reply - was busy at work. PS C:\Users\exp0se\Downloads\Kansa-master> Measure-Command {.\Modules\Disk\Get-FileHashes.ps1 MD5 C:\Windows}
Days : 0
Hours : 0
Minutes : 42
Seconds : 52
Milliseconds : 561
Ticks : 25725613939
TotalDays : 0,0297750161331019
TotalHours : 0,714600387194444
TotalMinutes : 42,8760232316667
TotalSeconds : 2572,5613939
TotalMilliseconds : 2572561,3939 I suspect the problem here is Workflows rather than hashing function, it also consumes tons of resources(few gigs or ram, a lot of cpu as well) making it impossible to run during working hours. Here is my alternative function from exp0se@e078012 PS C:\Users\exp0se\Downloads\Kansa-master> Measure-Command {.\Modules\Disk\Get-FileHash.ps1 C:\Windows MD5}
Get-ChildItem : Access to the path 'C:\Windows\CSC\v2.0.6' is denied.
At C:\Users\exp0se\Downloads\Kansa-master\Modules\Disk\Get-FileHash.ps1:155 char:1
+ Get-ChildItem -Path $BasePath -Recurse |Where-Object {$_.Name -match $extRegex}| ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : PermissionDenied: (C:\Windows\CSC\v2.0.6:String) [Get-ChildItem], UnauthorizedAccessEx
ption
+ FullyQualifiedErrorId : DirUnauthorizedAccessError,Microsoft.PowerShell.Commands.GetChildItemCommand
WARNING: Cannot calculate hash for directory: C:\Windows\Panther\setup.exe
Get-ChildItem : Access to the path 'C:\Windows\System32\LogFiles\WMI\RtBackup' is denied.
At C:\Users\exp0se\Downloads\Kansa-master\Modules\Disk\Get-FileHash.ps1:155 char:1
+ Get-ChildItem -Path $BasePath -Recurse |Where-Object {$_.Name -match $extRegex}| ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : PermissionDenied: (C:\Windows\Syst...es\WMI\RtBackup:String) [Get-ChildItem], Unauthor
edAccessException
+ FullyQualifiedErrorId : DirUnauthorizedAccessError,Microsoft.PowerShell.Commands.GetChildItemCommand
Days : 0
Hours : 0
Minutes : 3
Seconds : 24
Milliseconds : 590
Ticks : 2045902634
TotalDays : 0,00236794286342593
TotalHours : 0,0568306287222222
TotalMinutes : 3,40983772333333
TotalSeconds : 204,5902634
TotalMilliseconds : 204590,2634
Some hashing function benchmarks PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { ls | Get-FileHash -Algorithm MD5 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 144
Ticks : 1440400
TotalDays : 1,66712962962963E-06
TotalHours : 4,00111111111111E-05
TotalMinutes : 0,00240066666666667
TotalSeconds : 0,14404
TotalMilliseconds : 144,04 Function from my code: PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { ls | Get-FileHashCustom -Algorithm MD5 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 205
Ticks : 2053184
TotalDays : 2,37637037037037E-06
TotalHours : 5,70328888888889E-05
TotalMinutes : 0,00342197333333333
TotalSeconds : 0,2053184
TotalMilliseconds : 205,3184 Get-hashes function from Kansa module Disk\Get-Filehashes PS C:\Users\exp0se\Downloads\SysinternalsSuite> Measure-Command { Get-Hashes -BasePath . -HashType MD5 }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 147
Ticks : 1473163
TotalDays : 1,70504976851852E-06
TotalHours : 4,09211944444444E-05
TotalMinutes : 0,00245527166666667
TotalSeconds : 0,1473163
TotalMilliseconds : 147,3163
Turns out my function is even slower, well i guess we need to rename an issues to fix Get-Filehashes module rather replace hashing function as initially i tested it with Get-Filehashes full module. |
Hey,
I found that hashing function used throughout Kansa is ineffective and slow. It's okay for small operations like getting hashes of processes, but very slow when you use hashing for a path or for a whole disk.
Basically the problem is with ReadAllBytes function that you are using. In my tests i was able to get significant performance improvement by using function with StreamReader IO instead.
Can you take a look at this commit - exp0se@13b9aba
I mess up my commit and a whole bunch of other stuff also get commited - you only need to look at hashing related changes.
Can you consider replacing Kansa hashing function? I could send you a pull request later if you agree.
The text was updated successfully, but these errors were encountered: