Fix: reduce “Waiting on concurrency group” #8

yonran · 2023-08-18T09:47:34Z

fix: make deploy steps from different builds unordered

When setting the attribute concurrency_group on deploy steps, set concurrency_method: eager. By default (ordered), a BuildKite build’s deploy steps will wait for all the smoke_test and build steps of a previous build. This is a problem in rollbacks, which run much faster than normal builds because they have no smoke_test or build steps, so they are likely to reach their deploy steps before the previous build does.

This will make concurrency: 1 act in a similar manner to controlling concurrency by limiting the number of deploy queues. See BuildKite Docs: Controlling Concurrency: Controlling command Order.

fix: Allow concurrent validation_test steps

(this commit was removed)

Set the concurrency and concurrency_group attributes only on deploy steps, not validation_test, so that a validation_test step does not block another BuildKite build’s deploy step. Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

~~This will make the behavior more similar to the previous behavior. validation_test runs on a different queue than deploy so it never used to block deploy before we added concurrency_group.~~

When setting the attribute concurrency_group on deploy steps, set concurrency_method: eager. By default (ordered), a BuildKite build’s deploy steps will wait for all the smoke_test and build steps of a previous build. This is a problem in rollbacks, which run much faster than normal builds because they have no smoke_test or build steps, so they are likely to reach their deploy steps before the previous build does.

Set the concurrency and concurrency_group attributes only on deploy steps, not validation_test, so that a validation_test step does not block another BuildKite build’s deploy step. Note that one build’s validation test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

stanimoto · 2023-08-18T15:52:59Z

Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

On this point, I believe this was introduced to solve that very problem, but what was the reason for having to revert? If there is a validation test that is running, but we have a deployment we want to do immediately (for rollback, etc) and do not want to be blocked, then I think we should probably cancel the validation test at that time. Is there some other reason?

yonran · 2023-08-18T16:13:40Z

Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

On this point, I believe this was introduced to solve that very problem, but what was the reason for having to revert?

Yes, just to make deploys not wait for the previous validation_test.

If there is a validation test that is running, but we have a deployment we want to do immediately (for rollback, etc) and do not want to be blocked, then I think we should probably cancel the validation test at that time. Is there some other reason?

Yes, that works too. I am willing to revert that commit.

stanimoto · 2023-08-18T16:34:16Z

Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

On this point, I believe this was introduced to solve that very problem, but what was the reason for having to revert?

Yes, just to make deploys not wait for the previous validation_test.

Okay, thank you. I think we would want to keep the current behavior if possible. Do we need to set eager to the validation_test steps as well, and if so would it happen that the deploy from different builds could take a concurrency spot before the validation_test step runs?

This reverts commit c25c111. This was a minor speedup and we don't need it since we can easily cancel validation test steps if necessary.

yonran · 2023-08-18T20:47:25Z

Okay, thank you. I think we would want to keep the current behavior if possible. Do we need to set eager to the validation_test steps as well, and if so would it happen that the deploy from different builds could take a concurrency spot before the validation_test step runs?

Yes, I believe we should set eager on validation_test steps. Otherwise, BuildKite will see that previous build has smoke_test, build, deploy (eager),validation_test (ordered), and the next build has deploy (eager), validation_test (ordered), then I believe the next build’s deploy can run before the previous build’s deploy, but the next build’s validation_test will wait for all the previous build’s steps to run.

As for what happens when there are two steps in sequence that are both eager, unfortunately the documentation is silent. It only says that eager “removes the ordering condition for that resource”. I think it will happily interleave previous deploy, next deploy, previous validation_test, next validation_test.

In testing, I observed that when you use concurrency_group with eager, the jobs can interleave:

2:48:07 Build 5 started
2:48:09 Build 7 (rollback) started
2:48:10 Build 7’s deploy was scheduled; Build 5’s deploy was Limited waiting for concurrency group
2:50:40 Build 5’s deploy was scheduled; Build 7’s validation_test was Limited waiting for concurrency group
2:53:07 Build 5’s validation_test was scheduled

stanimoto · 2023-08-18T23:42:44Z

Thanks a lot for testing. So, if I understand correctly, if eager is set for either step, we can no longer expect deploy and validation_test of the same build to work as a single set, which is also similar to the previous behavior (or it's a bit worse if eager is set for both steps because the deploy from different builds completely blocks the validation_test because of concurrency group, and the validation_test will run against the deploy from different builds after all). Then, setting eager is probably not a long-term solution either, and the only way would be to change the timing of the job creation. If this does sound right to you as well, then it would be fine with me to remove concurrency group from the validation_test for the short-term.

yonran · 2023-08-21T16:59:54Z

Thanks a lot for testing. So, if I understand correctly, if eager is set for either step, we can no longer expect deploy and validation_test of the same build to work as a single set, which is also similar to the previous behavior (or it's a bit worse if eager is set for both steps because the deploy from different builds completely blocks the validation_test because of concurrency group, and the validation_test will run against the deploy from different builds after all). Then, setting eager is probably not a long-term solution either, and the only way would be to change the timing of the job creation. If this does sound right to you as well, then it would be fine with me to remove concurrency group from the validation_test for the short-term.

Correct. I think usually it will do the right thing of delaying the next deployment until the previous validation_test finishes. But there is a possibility of interleaving the next deploy (or some steps within a multistep deploy) before the previous validation_test.

I think making validation_test concurrency:1 still does some good even though it is not reliable anymore.

For our use case, I think “ordered” is what we want most of the time. One build should deploy to environment:qa (and the environment:prod build gets blocked) before the next build. It’s only rollback deploys that should cut in line before the previous build’s deploy job. I don’t know a way of letting only rollbacks run before the previous build’s deploy other than manually cancelling the previous build or changing to “eager”.

vipulnaik · 2023-08-21T20:28:29Z

@yonran, could we force "ordered" for deploys to QA but not for deploys to prod? Since rollbacks are prod-only, would that allow a rollback-only exception?

yonran · 2023-08-21T23:12:20Z

@yonran, could we force "ordered" for deploys to QA but not for deploys to prod? Since rollbacks are prod-only, would that allow a rollback-only exception?

Yes we could do that. That would ensure that automatic qa deploys and tests are still ordered, while cautious prod deploys use the riskier str for “Deploy #\d+ to prod.example.com” builds.

Thinking further, perhaps we could actually make the prod deploy + validation_test ordered too if we dynamically insert barriers when a deploy is starting (see Buildkite Blog (2020-11): Concurrency Gates or BuildKite Docs: Controlling concurrency: Concurrency and parallelism) (add one barrier after the block step but before the terraform apply step, and then add another barrier after the last validation_test. So the previous build would not create the barriers until it is about to do a deploy. Something to try.

bradchoate · 2023-08-24T16:00:58Z

For our use case, I think “ordered” is what we want most of the time.

Given that, why is concurrency_method set to eager in pipeline.go if it isn't explicitly declared?

Yonathan Randolph added 2 commits August 18, 2023 02:21

yonran requested a review from stanimoto August 18, 2023 09:47

Revert "fix: Allow concurrent validation_test steps"

fb9f50c

This reverts commit c25c111. This was a minor speedup and we don't need it since we can easily cancel validation test steps if necessary.

bradchoate approved these changes Aug 28, 2023

View reviewed changes

yonran merged commit dcdda65 into master Sep 1, 2023
2 checks passed

yonran deleted the deploy-prevent-concurrency branch September 1, 2023 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: reduce “Waiting on concurrency group” #8

Fix: reduce “Waiting on concurrency group” #8

yonran commented Aug 18, 2023 •

edited

Loading

stanimoto commented Aug 18, 2023

yonran commented Aug 18, 2023

stanimoto commented Aug 18, 2023

yonran commented Aug 18, 2023 •

edited

Loading

stanimoto commented Aug 18, 2023

yonran commented Aug 21, 2023 •

edited

Loading

vipulnaik commented Aug 21, 2023

yonran commented Aug 21, 2023 •

edited

Loading

bradchoate commented Aug 24, 2023

Fix: reduce “Waiting on concurrency group” #8

Fix: reduce “Waiting on concurrency group” #8

Conversation

yonran commented Aug 18, 2023 • edited Loading

fix: make deploy steps from different builds unordered

fix: Allow concurrent validation_test steps

stanimoto commented Aug 18, 2023

yonran commented Aug 18, 2023

stanimoto commented Aug 18, 2023

yonran commented Aug 18, 2023 • edited Loading

stanimoto commented Aug 18, 2023

yonran commented Aug 21, 2023 • edited Loading

vipulnaik commented Aug 21, 2023

yonran commented Aug 21, 2023 • edited Loading

bradchoate commented Aug 24, 2023

yonran commented Aug 18, 2023 •

edited

Loading

yonran commented Aug 18, 2023 •

edited

Loading

yonran commented Aug 21, 2023 •

edited

Loading

yonran commented Aug 21, 2023 •

edited

Loading