Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save import hash info to state #10531

Merged
merged 4 commits into from
Aug 20, 2024
Merged

save import hash info to state #10531

merged 4 commits into from
Aug 20, 2024

Conversation

dberenbaum
Copy link
Collaborator

This saves the hash info from imports to the output state db so that we don't rehash imports. Without this PR, imported files were always hashed when the output was saved. Existing imported files were even hashed again when doing operations like dvc import -f.

This should improve the speed of all imports, and it also reverts the changes in #10388 that caused slowdowns in some scenarios. That previous PR wrongly assumed the slowdown occurred during download, when it actually slowed during output.save(). That PR happened to work in some scenarios because the workaround updated the state db.

@dberenbaum dberenbaum marked this pull request as draft August 19, 2024 21:03
dvc/dependency/repo.py Outdated Show resolved Hide resolved
@dberenbaum dberenbaum marked this pull request as ready for review August 19, 2024 21:45
@skshetry
Copy link
Member

skshetry commented Aug 20, 2024

I did some minor adjustments to the PR (removed fs.walk() and instead, we get the results directly from fs.download(), also makes it cleaner as we don't have to handle path manipulation).

I also ran the dvc-bench benchmark for test_import in https://github.com/iterative/dvc-bench/actions/runs/10466916928/job/28984684233#step:9:44.

visualization

Name (time in s)
test_import-import[import-skip-hash-2] 278.6997 (1.0)
test_import-import[main] 392.8382 (1.41)

This is now back to 2.58.2 level.

@skshetry skshetry merged commit 81db9b4 into main Aug 20, 2024
26 checks passed
@skshetry skshetry deleted the import-skip-hash branch August 20, 2024 07:56
@dberenbaum
Copy link
Collaborator Author

Great changes, thanks! I think if we had a benchmark using large files, you would see an improvement over 2.58.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants