Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: ignore --downstream if used together with --pipeline #9692

Merged
merged 2 commits into from
Jul 4, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jul 3, 2023

On top of #9691.

Looking at the docs, it seems --pipeline should reproduce from the start of the pipeline, and ignore --downstream altogether.

There is also a bug here, when both are combined, repro runs the leaf nodes (stages that no one depend on).

@skshetry skshetry requested review from dberenbaum and efiop July 3, 2023 14:07
@skshetry skshetry added the bugfix fixes bug label Jul 3, 2023
pipelines = get_pipelines(active)
used_pipelines = [get_pipeline(pipelines, stage) for stage in stages]
# create a disjointed union of all the pipelines
active = nx.compose_all(used_pipelines)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am building graph for all cases, except --single-item. This might be useful if we implement #8647 and parallel repro.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was it doing before? I don't quite follow how the code changes relate to --downstream.

@dberenbaum, if you have a dag like following:

graph TD
A-->B
B-->C
Loading

When you do dvc repro --pipeline B, it'd run in order [A, B, C] as expected. But on --downstream, it would just run C, the last node.

DVC was searching for the leaf node of each pipeline, which in this case is C. By default, it finds all upstream nodes of that leaf node and runs them, which is a pipeline.

As a side effect of this logic, on the --downstream, it'd try to run downstream of C node.
Which means it only ran leaf nodes.


pipelines = get_pipelines(active)
pipelines = [get_pipeline(pipelines, stage) for stage in stages]
for pline in pipelines:
    leaves.extend(node for node in pline if pline.in_degree(node) == 0)

This was the previous code, that 1) finds all the pipelines, 2) filter pipelines that only have those nodes/stages, and 3) tries to find all the leaf node in that pipeline. (in_degree == 0 check means that it has no arc pointing towards it, but our repo.graph is reversed, so take it as having no arc pointing out.)

Regarding the code changes, we don't need to compute leaf node, then find upstream nodes at all. We can walk over the pipeline directly, so we filter like before:

used_pipelines = [get_pipeline(pipelines, stage) for stage in stages]

The other changes here is that there may be other disconnected pipelines, which are merged together, so that we can work with a single graph.

active = nx.compose_all(used_pipelines)

Then, we just walk like before, but we don't pass downstream at all.

return get_steps(active)

@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.09 🎉

Comparison is base (2becfd9) 90.54% compared to head (aec0110) 90.63%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9692      +/-   ##
==========================================
+ Coverage   90.54%   90.63%   +0.09%     
==========================================
  Files         480      480              
  Lines       36385    36725     +340     
  Branches     5227     5345     +118     
==========================================
+ Hits        32944    33286     +342     
- Misses       2851     2853       +2     
+ Partials      590      586       -4     
Impacted Files Coverage Δ
dvc/repo/reproduce.py 98.63% <100.00%> (-0.04%) ⬇️

... and 13 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dberenbaum
Copy link
Collaborator

Looking at the docs, it seems --pipeline should reproduce from the start of the pipeline, and ignore --downstream altogether.

What was it doing before? I don't quite follow how the code changes relate to --downstream.

@skshetry skshetry force-pushed the pipeline-ignore-downstream branch from c1d959e to aec0110 Compare July 4, 2023 11:45
Copy link
Contributor

@daavoo daavoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the docs, it seems --pipeline should reproduce from the start of the pipeline, and ignore --downstream altogether.

There is also a bug here, when both are combined, repro runs the leaf nodes (stages that no one depend on).

Can we add an explicit test for the change?

@skshetry
Copy link
Member Author

skshetry commented Jul 4, 2023

Looking at the docs, it seems --pipeline should reproduce from the start of the pipeline, and ignore --downstream altogether.

There is also a bug here, when both are combined, repro runs the leaf nodes (stages that no one depend on).

Can we add an explicit test for the change?

They are exclusive, and more of an edgecase. I don't think we need to test that.

@skshetry skshetry merged commit d69972d into iterative:main Jul 4, 2023
37 checks passed
@skshetry skshetry deleted the pipeline-ignore-downstream branch July 4, 2023 13:18
@shcheklein
Copy link
Member

@skshetry any docs updates after all these changes to the pipelines execution?

@skshetry
Copy link
Member Author

skshetry commented Jul 4, 2023

Not really, I only refactored the graph building logic, and most of the work is pre-requisite for #9705.

@skshetry
Copy link
Member Author

skshetry commented Jul 4, 2023

Also fixed 3-4 bugs along the way. But that's it. :)

@shcheklein
Copy link
Member

Also fixed 3-4 bugs along the way. But that's it. :)

yep, it's very usual thing these days, unfortunately (in all products). If it's 3-4 bugs, I would also think about adding tests though, as @daavoo mentioned. Since we had them it means we are not covering the behavior, right?

@skshetry
Copy link
Member Author

skshetry commented Jul 4, 2023

Testing for this specific case is not necessary, as it's noop here (we can't test everything).
The individual functions are much more unit-testable with all these refactoring.

Besides that, I have also added this as a living document for what should happen in the case with --all-pipeline/--pipeline.

dvc/dvc/repo/reproduce.py

Lines 176 to 178 in 63de16c

if all_pipelines or pipeline:
single_item = False
downstream = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants