Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: do not delete stage files before add #6239

Merged
merged 3 commits into from
Jul 3, 2021

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jun 28, 2021

Because of the way we collect stages and cache them, we were not able to
collect them for the add without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the dvc add in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see #2886 and #3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep states of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the Dulwich's IgnoreManager, which when reset
too many times will waste a lot of our time just collecting them again
next time (see #6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite a number
for a DVC repo) and the number of files that we are adding
(eg: -R adding a large directory).

On a directory with 10,000 files (in a dataset-registry repo),
creating stages on dvc add -R went from 64 files/sec to 1.1k files/sec.

Also relevant discussion: #3362

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Because of the way we collect stages and cache them, we were not able to
collect them for the `add` without removing them from the workspace.
As doing so, we'd have two same/similar stages - one collected from the
workspace and the other just created from the `dvc add` in-memory.

This would raise errors during graph checks, so we started to delete them
and reset them (which is very recently, see iterative#2886 and iterative#3349).

By deleting the file before we even do any checks, we are making DVC
fragile, and results in data loss for the users with even simple mistakes.
This should make it more reliable and robust.

And, recently, we have started to keep state of a lot of things, that by
resetting them on each stage, we waste a lot of performance, especially
on gitignores. We cache the dulwich's IgnoreManager, which when resetted
too many times, will waste a lot of our time just collecting them again
next time (see iterative#6227).

It's hard to say how much this improves, as this very much depends on
no. of gitignores in the repo (which can be assumed to be quite in number
for a dvc repo) and the amount of files that we are adding
(eg: `-R` adding a large directory).

On a directory with 10,000 files (in a datadet-registry repo),
creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.
@skshetry skshetry added enhancement Enhances DVC performance improvement over resource / time consuming tasks labels Jun 28, 2021
@skshetry skshetry requested review from efiop and pared June 28, 2021 10:24
@skshetry skshetry requested a review from a team as a code owner June 28, 2021 10:24
@skshetry skshetry self-assigned this Jun 28, 2021
@shcheklein
Copy link
Member

Can we add some tests for this (failed op doesn't remove file for example?). Or prevent regression from happening in some other way in the future.

Comment on lines +101 to +105
# remove existing stages that are to-be replaced with these
# new stages for the graph checks.
old_stages = set(repo.stages) - set(stages)
try:
repo.check_modified_graph(stages)
repo.check_modified_graph(stages, list(old_stages))
Copy link
Member Author

@skshetry skshetry Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better reflected by Index here, as we are creating a new index. Index is a collection of stages, so if some stages underneath are changed, is it the same index? So, immutability is at the heart of the index and would have been better reflected here.

new_index = repo.index.update(stages)
try:
    new_index.check_graph()

@skshetry
Copy link
Member Author

skshetry commented Jun 29, 2021

Can we add some tests for this (failed op doesn't remove file for example?). Or prevent regression from happening in some other way in the future.

Thanks, @shcheklein , I have added some tests. 🙂

@skshetry skshetry marked this pull request as draft July 1, 2021 15:31
@skshetry skshetry marked this pull request as ready for review July 1, 2021 15:32
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing stuff like this because we wanted to not leave a dvc-file in case of failure to not lead the users to believe that the command was actually successful, but that was a wrong approach. What we should've really done is worked on better indication of success/failure and avoid losing the information.

@efiop efiop merged commit 8d5f9ab into iterative:master Jul 3, 2021
@skshetry skshetry deleted the dont-delete-dvcfile branch July 4, 2021 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants