Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: reduce “Waiting on concurrency group” #8

Merged
merged 3 commits into from
Sep 1, 2023

Conversation

yonran
Copy link
Contributor

@yonran yonran commented Aug 18, 2023

fix: make deploy steps from different builds unordered

When setting the attribute concurrency_group on deploy steps, set concurrency_method: eager. By default (ordered), a BuildKite build’s deploy steps will wait for all the smoke_test and build steps of a previous build. This is a problem in rollbacks, which run much faster than normal builds because they have no smoke_test or build steps, so they are likely to reach their deploy steps before the previous build does.

This will make concurrency: 1 act in a similar manner to controlling concurrency by limiting the number of deploy queues. See BuildKite Docs: Controlling Concurrency: Controlling command Order.

fix: Allow concurrent validation_test steps

(this commit was removed)

Set the concurrency and concurrency_group attributes only on deploy steps, not validation_test, so that a validation_test step does not block another BuildKite build’s deploy step. Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

This will make the behavior more similar to the previous behavior. validation_test runs on a different queue than deploy so it never used to block deploy before we added concurrency_group.

Yonathan Randolph added 2 commits August 18, 2023 02:21
When setting the attribute concurrency_group on deploy steps, set concurrency_method: eager. By default (ordered), a BuildKite build’s deploy steps will wait for all the smoke_test and build steps of a previous build. This is a problem in rollbacks, which run much faster than normal builds because they have no smoke_test or build steps, so they are likely to reach their deploy steps before the previous build does.
Set the concurrency and concurrency_group attributes only on deploy steps, not validation_test, so that a validation_test step does not block another BuildKite build’s deploy step. Note that one build’s validation test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.
@yonran yonran requested a review from stanimoto August 18, 2023 09:47
@stanimoto
Copy link
Contributor

Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

On this point, I believe this was introduced to solve that very problem, but what was the reason for having to revert? If there is a validation test that is running, but we have a deployment we want to do immediately (for rollback, etc) and do not want to be blocked, then I think we should probably cancel the validation test at that time. Is there some other reason?

@yonran
Copy link
Contributor Author

yonran commented Aug 18, 2023

Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

On this point, I believe this was introduced to solve that very problem, but what was the reason for having to revert?

Yes, just to make deploys not wait for the previous validation_test.

If there is a validation test that is running, but we have a deployment we want to do immediately (for rollback, etc) and do not want to be blocked, then I think we should probably cancel the validation test at that time. Is there some other reason?

Yes, that works too. I am willing to revert that commit.

@stanimoto
Copy link
Contributor

Note that one build’s validation_test may fail if another build deploys some other code, but that is something we will just have to be aware of when two deploys happen right after each other.

On this point, I believe this was introduced to solve that very problem, but what was the reason for having to revert?

Yes, just to make deploys not wait for the previous validation_test.

Okay, thank you. I think we would want to keep the current behavior if possible. Do we need to set eager to the validation_test steps as well, and if so would it happen that the deploy from different builds could take a concurrency spot before the validation_test step runs?

This reverts commit c25c111. This was a minor speedup and we don't need it since we can easily cancel validation test steps if necessary.
@yonran
Copy link
Contributor Author

yonran commented Aug 18, 2023

Okay, thank you. I think we would want to keep the current behavior if possible. Do we need to set eager to the validation_test steps as well, and if so would it happen that the deploy from different builds could take a concurrency spot before the validation_test step runs?

Yes, I believe we should set eager on validation_test steps. Otherwise, BuildKite will see that previous build has smoke_test, build, deploy (eager),validation_test (ordered), and the next build has deploy (eager), validation_test (ordered), then I believe the next build’s deploy can run before the previous build’s deploy, but the next build’s validation_test will wait for all the previous build’s steps to run.

As for what happens when there are two steps in sequence that are both eager, unfortunately the documentation is silent. It only says that eager “removes the ordering condition for that resource”. I think it will happily interleave previous deploy, next deploy, previous validation_test, next validation_test.

In testing, I observed that when you use concurrency_group with eager, the jobs can interleave:

  • 2:48:07 Build 5 started
  • 2:48:09 Build 7 (rollback) started
  • 2:48:10 Build 7’s deploy was scheduled; Build 5’s deploy was Limited waiting for concurrency group
  • 2:50:40 Build 5’s deploy was scheduled; Build 7’s validation_test was Limited waiting for concurrency group
  • 2:53:07 Build 5’s validation_test was scheduled

@stanimoto
Copy link
Contributor

Thanks a lot for testing. So, if I understand correctly, if eager is set for either step, we can no longer expect deploy and validation_test of the same build to work as a single set, which is also similar to the previous behavior (or it's a bit worse if eager is set for both steps because the deploy from different builds completely blocks the validation_test because of concurrency group, and the validation_test will run against the deploy from different builds after all). Then, setting eager is probably not a long-term solution either, and the only way would be to change the timing of the job creation. If this does sound right to you as well, then it would be fine with me to remove concurrency group from the validation_test for the short-term.

@yonran
Copy link
Contributor Author

yonran commented Aug 21, 2023

Thanks a lot for testing. So, if I understand correctly, if eager is set for either step, we can no longer expect deploy and validation_test of the same build to work as a single set, which is also similar to the previous behavior (or it's a bit worse if eager is set for both steps because the deploy from different builds completely blocks the validation_test because of concurrency group, and the validation_test will run against the deploy from different builds after all). Then, setting eager is probably not a long-term solution either, and the only way would be to change the timing of the job creation. If this does sound right to you as well, then it would be fine with me to remove concurrency group from the validation_test for the short-term.

Correct. I think usually it will do the right thing of delaying the next deployment until the previous validation_test finishes. But there is a possibility of interleaving the next deploy (or some steps within a multistep deploy) before the previous validation_test.

I think making validation_test concurrency:1 still does some good even though it is not reliable anymore.

For our use case, I think “ordered” is what we want most of the time. One build should deploy to environment:qa (and the environment:prod build gets blocked) before the next build. It’s only rollback deploys that should cut in line before the previous build’s deploy job. I don’t know a way of letting only rollbacks run before the previous build’s deploy other than manually cancelling the previous build or changing to “eager”.

@vipulnaik
Copy link

@yonran, could we force "ordered" for deploys to QA but not for deploys to prod? Since rollbacks are prod-only, would that allow a rollback-only exception?

@yonran
Copy link
Contributor Author

yonran commented Aug 21, 2023

@yonran, could we force "ordered" for deploys to QA but not for deploys to prod? Since rollbacks are prod-only, would that allow a rollback-only exception?

Yes we could do that. That would ensure that automatic qa deploys and tests are still ordered, while cautious prod deploys use the riskier str for “Deploy #\d+ to prod.example.com” builds.

Thinking further, perhaps we could actually make the prod deploy + validation_test ordered too if we dynamically insert barriers when a deploy is starting (see Buildkite Blog (2020-11): Concurrency Gates or BuildKite Docs: Controlling concurrency: Concurrency and parallelism) (add one barrier after the block step but before the terraform apply step, and then add another barrier after the last validation_test. So the previous build would not create the barriers until it is about to do a deploy. Something to try.

@bradchoate
Copy link
Member

For our use case, I think “ordered” is what we want most of the time.

Given that, why is concurrency_method set to eager in pipeline.go if it isn't explicitly declared?

@yonran yonran merged commit dcdda65 into master Sep 1, 2023
2 checks passed
@yonran yonran deleted the deploy-prevent-concurrency branch September 1, 2023 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants