Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase hashing buffer size #1357

Closed
wants to merge 1 commit into from
Closed

Conversation

i512
Copy link

@i512 i512 commented Oct 1, 2024

Hi guys, thanks for your project.
Full hashing was taking a lot of time on my HDD, and task manager reported read speed was kinda low, ~25MB/s so I suspected the small buffer might cause it.

Tested with 16MB buffer read speed increased to ~150MB/s which is what I would expect.
20 gb of duplicates is scanned in 3 minutes vs 13 on current release (without cache).

Interestingly if I run the same test on an NVMe then there's almost no difference: 17s vs 18s, ~2.4GB/s read speed.
I guess windows is reading small blocks from the drive without doing a lot of readahead. Increasing the buffer is kind of a simple fix.

@i512 i512 marked this pull request as draft October 1, 2024 18:41
@i512 i512 force-pushed the enlarge-hashing-buffer branch from ace9a9e to fd135b4 Compare October 1, 2024 19:15
@i512 i512 force-pushed the enlarge-hashing-buffer branch 2 times, most recently from 2a83fe3 to 1a920c9 Compare October 3, 2024 21:07
@i512 i512 marked this pull request as ready for review October 3, 2024 21:40
@qarmin
Copy link
Owner

qarmin commented Oct 4, 2024

This is a little strange. Using stack should be faster than allocating each time buffer on heap(but probably hdd is bigger bottleneck here).
Previous version used here 128KB array, but there was a lot of stack overflow so I decreased value to fix problems.

This may improve performance for bigger files, but I doubt that allocating 16MB of memory even for small 5KB files will not decrease performation in such situation.

To verify that a benchmark for function hash_calculation should be created.

I think that mixed solution, based on file size, should works better, but without clear benchmark with different parameters, we cannot be sure.

@i512
Copy link
Author

i512 commented Oct 5, 2024

Sorry, I'm new to rust, didn't realize buffer would be allocated each time. That's not good.

I think that mixed solution, based on file size, should works better, but without clear benchmark with different parameters, we cannot be sure.

This can work, but it will require tuning, yes.

I think the usual approach is to reuse the buffers between jobs. I can't figure out how to do this yet though. Tried creating buffers in map_init, but I did not work as I expected. So I'm probably going to close this for now.

Reducing the number of threads to 1 (in current release with 16kb buffer) actually improves the read speed with big files in my case. Oh I see that this was already mentioned in #835 nice.

@i512 i512 closed this Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants