RFC: Add execution concurrency #5659

katrogan · 2024-08-13T13:14:48Z

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Katrina Rogan <[email protected]>

codecov · 2024-08-13T13:24:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.17%. Comparing base (d797f08) to head (da704b4).
Report is 25 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5659       +/-   ##
===========================================
+ Coverage    9.74%   36.17%   +26.42%     
===========================================
  Files         214     1303     +1089     
  Lines       39190   109672    +70482     
===========================================
+ Hits         3820    39671    +35851     
- Misses      35031    65855    +30824     
- Partials      339     4146     +3807

Flag	Coverage Δ
unittests-datacatalog	`51.37% <ø> (ø)`
unittests-flyteadmin	`55.28% <ø> (?)`
unittests-flytecopilot	`12.17% <ø> (ø)`
unittests-flytectl	`62.17% <ø> (?)`
unittests-flyteidl	`7.12% <ø> (+0.04%)`	⬆️
unittests-flyteplugins	`53.35% <ø> (?)`
unittests-flytepropeller	`41.76% <ø> (?)`
unittests-flytestdlib	`55.35% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pingsutw · 2024-08-16T08:34:15Z

rfc/system/RFC-0000-execution-concurrency.md

+- created_at
+
+#### Open Questions
+- Should we always attempt to schedule pending executions in ascending order of creation time?


Maybe make it configurable? FIFO, FILO

yeah I wasn't sure! any suggestions here? we could introduce an enum and choose fifo to begin with and expand support

I have mixed thoughts on making the queue's order of execution configurable.

If we support a limited number of parallel executions (more than 1), the order of these executions would naturally start as FIFO up until that limit is reached.

To me, providing an option to begin executing FILO after that limit is reached feels confusing to me.

However, that brings a different question to mind: If multiple workflows are queued up, should we provide an option to enable loud notifications?

In other words, if backlogged executions have the possibility of impacting downstream operations, can we enable users to receive loud notifications, including the number of queued executions?

I can imagine a use case where: holiday shopping -> increased purchase volume -> increased data size -> multiple, consecutive execution delays -> cascading backlog of executions. In this scenario, the owners of the workflow may be out on leave and not be aware of the growing backlog.

interesting, we have workflow notifications enabled for terminal state but we've talked more about richer, customizable notifications and I think this slates neatly into that

I think for a v1 having the default behavior be fifo with an extended description/explanation for the pending state may provide some visibility here to start off with

Can we add this suggestion of having an enum listing the policies to the Implementation details section?

Can add a customers feedback here, where the desired behaviour is to actually replace (terminate) the current executions by subsequent executions. Sounds like too much for the initial scope but still interested if this would be possible to add later with the current approach?

@fiedlerNr9 added a section under Alternatives. I don't think this is precluded by this implementation but not in scope for this proposal atm

eapolinario

This is looking pretty good. I'd feel more comfortable if we fleshed out the implementation a bit more, but otherwise, I feel like we're on the same page.

eapolinario · 2024-08-19T21:38:42Z

rfc/system/RFC-0000-execution-concurrency.md

+
+```
+
+### FlyteAdmin


During last week's contributors meeting someone asked a question about having this concurrency control work across versions. Can we either have a discussion in this PR about it or list that use case as not being supported explicitly in the RFC?

I can say that something that works across versions would be really useful for us.

For us too because we very often pyflyte run which means we often don't have two executions of the same version.

This could be made configurable here:

concurrency=Concurrency( max=1, # defines how many executions with this launch plan can run in parallel policy=ConcurrencyPolicy.WAIT # defines the policy to apply when the max concurrency is reached, level=ConcurrencyLevel.Version, # or ConcurrencyLevel.LaunchPlan )

thanks @eapolinario @corleyma @fg91 for the feedback, I don't think this will be too much of a lift but added a proposal for different levels of precision here too

eapolinario · 2024-08-19T21:39:29Z

rfc/system/RFC-0000-execution-concurrency.md

+      1. or fail the request when the concurrency policy is set to `ABORT`
+   1. Do not create the workflow CRD
+
+Introduce an async reconciliation loop in FlyteAdmin to poll for all pending executions:


Do we have prior art for this kind of reconciliation loop in flyteadmin?

yes, the scheduler!

rfc/system/RFC-0000-execution-concurrency.md

eapolinario · 2024-08-19T21:42:24Z

rfc/system/RFC-0000-execution-concurrency.md

+- created_at
+
+#### Open Questions
+- Should we always attempt to schedule pending executions in ascending order of creation time?


Can we add this suggestion of having an enum listing the policies to the Implementation details section?

eapolinario · 2024-08-19T21:45:19Z

rfc/system/RFC-0000-execution-concurrency.md

+
+## 4 Metrics & Dashboards
+
+*What are the main metrics we should be measuring? For example, when interacting with an external system, it might be the external system latency. When adding a new table, how fast would it fill up?*


How is this feature going to be rolled out? Should we have an explicit list of metrics used to help the health of the feature? (e.g. total number of attempts of a given launchplan )

Interesting question. I think scheduling attempts here is based on the polling interval right? But could be useful to understand time spent in PENDING

eapolinario · 2024-08-19T21:46:05Z

rfc/system/RFC-0000-execution-concurrency.md

+
+## 5 Drawbacks
+
+*Are there any reasons why we should not do this? Here we aim to evaluate risk and check ourselves.*


Do we have any reservations about more load on the DB (even with indexes, etc)?

good point, we already have a ton of indices on executions - there is definitely a tradeoff to adding a new one

katrogan · 2024-08-20T13:45:31Z

I'd feel more comfortable if we fleshed out the implementation a bit more, but otherwise, I feel like we're on the same page.

Sounds good, just wanted overall alignment before diving into the implementation. Will do that next and thank you already for all the feedback

Signed-off-by: Katrina Rogan <[email protected]>

katrogan · 2024-08-22T20:05:11Z

added some more implementation details, mind taking another look @eapolinario

fg91 · 2024-08-29T16:54:18Z

rfc/system/RFC-0000-execution-concurrency.md

+}
+```
+
+Furthermore, we may want to introduce a max pending period to fail executions that have been in `PENDING` for too long


👍 I agree that this would be good.

fg91 · 2024-08-29T16:55:21Z

rfc/system/RFC-0000-execution-concurrency.md

+## 8 Unresolved questions
+
+- Should we always attempt to schedule pending executions in ascending order of creation time?
+    - Decision: We'll use FIFO scheduling by default but can extend scheduling behavior with an enum going forward.


If this has been decided (I'm ok with it), could you please reformulate in the text above where this is still discussed as an open question? 🙏

updated the discussion above!

fg91 · 2024-08-29T16:55:55Z

rfc/system/RFC-0000-execution-concurrency.md

Can you please change the filename to include the PR number?

done, thanks

corleyma · 2024-08-29T19:29:40Z

I'd add one thing, possibly out of scope for this RFC: it would be really nice to be able to define a "max execution concurrency" on the backend, either propeller-wide or per project/domain. Flyte would benefit from more controls that allow operators to protect quality of service and aren't dependent on workflow authors to set reasonable limits.

katrogan · 2024-08-29T19:49:26Z

hi @corleyma thanks for reviewing! re your comment on platform-max execution concurrency, that's really intriguing - would you want to start a separate discussion on that here: https://github.com/flyteorg/flyte/discussions so we don't lose track of the suggestion?

execution namespace quota is meant to help address quality of service and fairness in a multitenant system but it would be cool to flesh out other mechanisms for managing overall executions

Signed-off-by: Katrina Rogan <[email protected]>

corleyma · 2024-09-04T22:56:00Z

execution namespace quota is meant to help address quality of service and fairness in a multitenant system but it would be cool to flesh out other mechanisms for managing overall executions

execution namespace quota can help protect against workloads that would otherwise utilize too many cluster resources, but it doesn't really help protect e.g. flyte propeller from too many concurrent executions.

I am happy to start a separate conversation though!

nikp1172 · 2024-09-09T07:55:50Z

rfc/system/RFC-5659-execution-concurrency.md

+  ConcurrencyPolicy policy = 2;
+}
+
+enum ConcurrencyPolicy {


we should have a replace option also?
To stop the previous execution and replace it with the current one. This is what k8s job does.

https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#concurrency-policy

Would be great if the behaviour can be made as close to this.

katrogan added 4 commits August 13, 2024 15:14

checkpoint

1d4d4b7

Signed-off-by: Katrina Rogan <[email protected]>

formatting

5f9a60f

Signed-off-by: Katrina Rogan <[email protected]>

formatting

9b66651

Signed-off-by: Katrina Rogan <[email protected]>

grammar

a108074

Signed-off-by: Katrina Rogan <[email protected]>

eapolinario mentioned this pull request Aug 15, 2024

[Feature] Support serializing Scheduled Executions #420

Open

13 tasks

pingsutw reviewed Aug 16, 2024

View reviewed changes

katrogan marked this pull request as ready for review August 19, 2024 07:54

eapolinario reviewed Aug 19, 2024

View reviewed changes

katrogan added 3 commits August 20, 2024 15:49

Merge branch 'master' into rfc/execution-concurrency

0b4ee1f

review comments, still need to flesh out impl

121066c

Signed-off-by: Katrina Rogan <[email protected]>

details

4a88d7c

Signed-off-by: Katrina Rogan <[email protected]>

fg91 reviewed Aug 29, 2024

View reviewed changes

katrogan mentioned this pull request Sep 3, 2024

[Feature] Prevent concurrent execution #267

Open

13 tasks

katrogan added 3 commits September 4, 2024 16:12

More feedback, update filename

550c571

Signed-off-by: Katrina Rogan <[email protected]>

comment

61debe2

Signed-off-by: Katrina Rogan <[email protected]>

details

da704b4

Signed-off-by: Katrina Rogan <[email protected]>

nikp1172 reviewed Sep 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Add execution concurrency #5659

RFC: Add execution concurrency #5659

katrogan commented Aug 13, 2024

codecov bot commented Aug 13, 2024 •

edited

Loading

pingsutw Aug 16, 2024

katrogan Aug 19, 2024

granthamtaylor Aug 19, 2024 •

edited

Loading

katrogan Aug 19, 2024

eapolinario Aug 19, 2024

fiedlerNr9 Sep 3, 2024

katrogan Sep 4, 2024 •

edited

Loading

eapolinario left a comment

eapolinario Aug 19, 2024

corleyma Aug 23, 2024

fg91 Aug 29, 2024

katrogan Sep 4, 2024

eapolinario Aug 19, 2024

katrogan Aug 20, 2024

eapolinario Aug 19, 2024

eapolinario Aug 19, 2024

katrogan Aug 20, 2024

eapolinario Aug 19, 2024

katrogan Aug 20, 2024

katrogan commented Aug 20, 2024

katrogan commented Aug 22, 2024

fg91 Aug 29, 2024

fg91 Aug 29, 2024

katrogan Sep 4, 2024

fg91 Aug 29, 2024

katrogan Sep 4, 2024

corleyma commented Aug 29, 2024

katrogan commented Aug 29, 2024

corleyma commented Sep 4, 2024

nikp1172 Sep 9, 2024


		## 4 Metrics & Dashboards

		What are the main metrics we should be measuring? For example, when interacting with an external system, it might be the external system latency. When adding a new table, how fast would it fill up?


		## 5 Drawbacks

		Are there any reasons why we should not do this? Here we aim to evaluate risk and check ourselves.


		```

		### FlyteAdmin

RFC: Add execution concurrency #5659

Are you sure you want to change the base?

RFC: Add execution concurrency #5659

Conversation

katrogan commented Aug 13, 2024

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Aug 13, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

granthamtaylor Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katrogan Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

eapolinario left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katrogan commented Aug 20, 2024

katrogan commented Aug 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corleyma commented Aug 29, 2024

katrogan commented Aug 29, 2024

corleyma commented Sep 4, 2024

Choose a reason for hiding this comment

codecov bot commented Aug 13, 2024 •

edited

Loading

granthamtaylor Aug 19, 2024 •

edited

Loading

katrogan Sep 4, 2024 •

edited

Loading