Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Mar 1, 2023

This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains.

"Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s. Thankfully our server-side ratelimiters shed that just fine, and the client service didn't have a noticeable cpu increase either (to be fair, they were likely mostly waiting in back-off).

A larger description will come after I get some more sleep, but the quick and dirty summary is:

  • they had many "live" workflows
  • they started to build up a decision-schedule queue
  • slowing them down
  • overloading caches, causing a lot of un-cached decisions
  • ... leading to a lot of history iterators in new workflows looping, trying to load history, and getting ratelimited...
  • ... causing more to loop and try to load history...
  • ... slowing things down further and making it worse.

Ultimately the root issue is that these history-loading requests are not limited at all except by the sticky cache size... which has good reasons to keep as high as is practical. But doing so risks extreme request rates like this.

Decision tasks were regularly >10 minutes, just trying to load history.

So this is an attempt to prevent that from happening.
It's not yet complete, just contains the limiter I'm planning, and tests.


My plan for mitigating / solving ^ that explosion is to allow probably 10 history requests per poller. Intentionally badly starving polls that end up requesting a lot of history.

And then do something like NewWeightedPoller(pollWeight: 10, historyWeight: 1, maxResources: 29) to allow 9 history requests even when both polls are running, but let history stop polls if there are a lot of them.

10 / 29 / etc will likely need to be found experimentally tho. That's just a blind guess that's sure to stop the problem, but not necessarily perform well.

…ry over polling for new workflows.

This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains.
"Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s.

A larger description will come after I get some more sleep, but the quick and dirty summary is:
- they had many "live" workflows
- they started to build up a decision-schedule queue
- slowing them down
- overloading caches, causing a lot of un-cached decisions
- ... leading to a lot of history iterators in new workflows looping, trying to load history, and getting ratelimited...
- ... causing more to loop and try to load history...
- ... slowing things down further and making it worse.

Decision tasks were regularly >10 minutes, just trying to load history.

So this is an attempt to prevent that from happening.
It's not yet complete, just contains the limiter I'm planning, and tests.
if bw.pollLimiter == nil || bw.pollLimiter.Wait(bw.limiterContext) == nil {
// TODO: block here on a shared semaphore with history-loading?
Copy link
Contributor Author

@Groxx Groxx Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where I'm planning on putting it. I just need to wire it up to config and construct it in the worker-init. and do something to get it into the history iterator loop - probably not too hard.

at this point the error-throttler has already delayed things, and we're inside both the auto-scaler and per-second limiter, so an immediate poll is correct, and a delayed poll is... potentially exceeding per-second, but not exceeding concurrent.

this is not the most hygienic location, I think this whole thing might be worth revamping. but I don't believe it will make it behave worse in the meantime.

Comment on lines +241 to +242
// run N times in parallel to reduce noise.
// none may have zeros, and it should be heavily skewed in favor of history requests
Copy link
Contributor Author

@Groxx Groxx Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used this kind of "run, measure, retry if I can't find what I expect" thing in other concurrency-fairness-probing tests I've written (not in this repo), it has worked out fairly well. Though in those I've generally done several million or billion iterations.

I'm definitely open to ideas to improve it though, this was sorta a blind duplication because I knew it'd be an issue because it's inherently noisy. I'm sure there are better ways.

@@ -0,0 +1,305 @@
// Package pahlimiter contains a PollAndHistoryLimiter, used to share resources between polls and history loading,
// to prevent flooding the server with history requests that will not complete in a reasonable time.
package pahlimiter
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generalize this a little more? Looks like you have some good building blocks here (a coordinator that manages a shared resource pool, a set of resource consumers, configs to tune the algorithm, etc.) that can be defined a little more abstractly to reduce redundancy.

Imagine adding a third resource consumer in here. It would lead to changing pretty much everything in this package, including the name of the package. Although polls and history fetches are the main calls that we worry about, there are also other calls (heartbeats, respond*, resetsticky, etc.) which we might consider capturing later on. I don't want to force you to over-engineer this, but it'd be good if we can at least make the user-facing API of the package to be more generalized. We should also be able to remove some redundancy in the init() function as well as across Poll() and GetHistory() functions that are nearly identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make selects flexible in quantity involves some rather annoying reflection (or codegen)... but yeah, I think this is relatively generalizable. I'll think about it anyway, may take some fiddling to make something "good".

// The done func will release the resource - it will always be returned and can be called multiple times,
// only the first will have an effect.
// TODO: see if this is necessary... but it's easy and safe.
GetHistory(context.Context) (ok bool, done func())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where exactly are you planning to call this one?

Copy link
Contributor Author

@Groxx Groxx Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not entirely decided tbh.

  • a trivial answer is "inside the get-history backoff callback", but that leaves substantial gaps for polls to occur.
  • around the whole backoff callback allows us to consume this resource during retries, which seems reasonable, though might lead to confusing metrics (pollers dropping without apparent cause, etc).
    • this is what I'm leaning towards though.
  • around the whole "process a decision task" is potentially difficult, but would be the most-complete protection... but I'm not confident that this would be safe, as we leak some workflow goroutines in unknown scenarios (I suspect corrupt histories or panics) and that could lead to permanently consumed resources. probably desirable, but IMO too risky without some kind of detection.

//
// All that said: this is NOT built to be a reliable blocker of polls for at least two reasons:
// - History iterators do not hold their resources between loading (and executing) pages of history, causing a gap
// where a poller could claim resources despite the service being "too busy" loading history from a human's view.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this happens, it'll lead to an increased number of decision tasks that timeout, because the bandwidth will be split across many workflows that are all trying to fetch history concurrently. Given a fixed bandwidth, when the number of actors increases the average latency increases. Once latency is high enough to cause the tasks to timeout before even the replay is done, the worker throughput will drop to zero since the Respond() calls to the cadence server will be rejected due to task timeout.

In order to prevent that, we need to guarantee that a decision task that started fetching history is prioritized over a decision task that hasn't started fetching history. So, I think what that means is that the replayer should hold the resources until it fetches all existing history pages rather than releasing & re-acquiring for each page.

@davidporter-id-au
Copy link
Contributor

I would add a brief few lines about why this code is client-side and not server-side similar to how we discussed offline here

t.Parallel()

t.Run("should not limit poll requests", func(t *testing.T) {
t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the duplicate use of t.Parallel here intentional?

Copy link
Contributor Author

@Groxx Groxx Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. the "outer" one allows this top-level test to run concurrently with other top-level tests. the "inner" ones allow each of its sub-tests to run concurrently with each other inside this test.

both are needed to allow all sub-tests in this file to run simultaneously.
e.g. remove the outer one and TestHistoryLimited will start, run all sub-tests, and complete them all before TestUnlimited is allowed to unblock its parallel tests.

(pretty sure about that interpretation anyway. it has been a while since I checked, but iirc tests do not "end" and allow the next to run until all their children are complete, unless parallel. regardless, calling it 2x at any point is a fatal error, so the test framework guarantees these are not truly duplicated or it would not pass)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants