Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Groxx · 2023-03-01T10:46:49Z

This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains.

"Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s. Thankfully our server-side ratelimiters shed that just fine, and the client service didn't have a noticeable cpu increase either (to be fair, they were likely mostly waiting in back-off).

A larger description will come after I get some more sleep, but the quick and dirty summary is:

they had many "live" workflows
they started to build up a decision-schedule queue
slowing them down
overloading caches, causing a lot of un-cached decisions
... leading to a lot of history iterators in new workflows looping, trying to load history, and getting ratelimited...
... causing more to loop and try to load history...
... slowing things down further and making it worse.

Ultimately the root issue is that these history-loading requests are not limited at all except by the sticky cache size... which has good reasons to keep as high as is practical. But doing so risks extreme request rates like this.

Decision tasks were regularly >10 minutes, just trying to load history.

So this is an attempt to prevent that from happening.
It's not yet complete, just contains the limiter I'm planning, and tests.

My plan for mitigating / solving ^ that explosion is to allow probably 10 history requests per poller. Intentionally badly starving polls that end up requesting a lot of history.

And then do something like NewWeightedPoller(pollWeight: 10, historyWeight: 1, maxResources: 29) to allow 9 history requests even when both polls are running, but let history stop polls if there are a lot of them.

10 / 29 / etc will likely need to be found experimentally tho. That's just a blind guess that's sure to stop the problem, but not necessarily perform well.

…ry over polling for new workflows. This is to address an explosion of GetWorkflowExecutionHistory requests in one of our internal domains. "Explosion" to the tune of: normally a couple hundred per second, but during this issue we saw up to ~100,000/s. A larger description will come after I get some more sleep, but the quick and dirty summary is: - they had many "live" workflows - they started to build up a decision-schedule queue - slowing them down - overloading caches, causing a lot of un-cached decisions - ... leading to a lot of history iterators in new workflows looping, trying to load history, and getting ratelimited... - ... causing more to loop and try to load history... - ... slowing things down further and making it worse. Decision tasks were regularly >10 minutes, just trying to load history. So this is an attempt to prevent that from happening. It's not yet complete, just contains the limiter I'm planning, and tests.

Groxx · 2023-03-01T10:51:23Z

internal/internal_worker_base.go

 	if bw.pollLimiter == nil || bw.pollLimiter.Wait(bw.limiterContext) == nil {
+		// TODO: block here on a shared semaphore with history-loading?


this is where I'm planning on putting it. I just need to wire it up to config and construct it in the worker-init. and do something to get it into the history iterator loop - probably not too hard.

at this point the error-throttler has already delayed things, and we're inside both the auto-scaler and per-second limiter, so an immediate poll is correct, and a delayed poll is... potentially exceeding per-second, but not exceeding concurrent.

this is not the most hygienic location, I think this whole thing might be worth revamping. but I don't believe it will make it behave worse in the meantime.

Groxx · 2023-03-01T10:54:48Z

internal/pahlimiter/limiter_test.go

+		// run N times in parallel to reduce noise.
+		// none may have zeros, and it should be heavily skewed in favor of history requests


I've used this kind of "run, measure, retry if I can't find what I expect" thing in other concurrency-fairness-probing tests I've written (not in this repo), it has worked out fairly well. Though in those I've generally done several million or billion iterations.

I'm definitely open to ideas to improve it though, this was sorta a blind duplication because I knew it'd be an issue because it's inherently noisy. I'm sure there are better ways.

emrahs · 2023-03-02T00:18:18Z

internal/pahlimiter/limiter.go

@@ -0,0 +1,305 @@
+// Package pahlimiter contains a PollAndHistoryLimiter, used to share resources between polls and history loading,
+// to prevent flooding the server with history requests that will not complete in a reasonable time.
+package pahlimiter


Can we generalize this a little more? Looks like you have some good building blocks here (a coordinator that manages a shared resource pool, a set of resource consumers, configs to tune the algorithm, etc.) that can be defined a little more abstractly to reduce redundancy.

Imagine adding a third resource consumer in here. It would lead to changing pretty much everything in this package, including the name of the package. Although polls and history fetches are the main calls that we worry about, there are also other calls (heartbeats, respond*, resetsticky, etc.) which we might consider capturing later on. I don't want to force you to over-engineer this, but it'd be good if we can at least make the user-facing API of the package to be more generalized. We should also be able to remove some redundancy in the init() function as well as across Poll() and GetHistory() functions that are nearly identical.

to make selects flexible in quantity involves some rather annoying reflection (or codegen)... but yeah, I think this is relatively generalizable. I'll think about it anyway, may take some fiddling to make something "good".

emrahs · 2023-03-02T00:27:06Z

internal/pahlimiter/limiter.go

+		// The done func will release the resource - it will always be returned and can be called multiple times,
+		// only the first will have an effect.
+		// TODO: see if this is necessary... but it's easy and safe.
+		GetHistory(context.Context) (ok bool, done func())


Where exactly are you planning to call this one?

not entirely decided tbh.

a trivial answer is "inside the get-history backoff callback", but that leaves substantial gaps for polls to occur.

around the whole backoff callback allows us to consume this resource during retries, which seems reasonable, though might lead to confusing metrics (pollers dropping without apparent cause, etc).

this is what I'm leaning towards though.

around the whole "process a decision task" is potentially difficult, but would be the most-complete protection... but I'm not confident that this would be safe, as we leak some workflow goroutines in unknown scenarios (I suspect corrupt histories or panics) and that could lead to permanently consumed resources. probably desirable, but IMO too risky without some kind of detection.

emrahs · 2023-03-02T01:39:20Z

internal/pahlimiter/limiter.go

+//
+// All that said: this is NOT built to be a reliable blocker of polls for at least two reasons:
+//   - History iterators do not hold their resources between loading (and executing) pages of history, causing a gap
+//     where a poller could claim resources despite the service being "too busy" loading history from a human's view.


When this happens, it'll lead to an increased number of decision tasks that timeout, because the bandwidth will be split across many workflows that are all trying to fetch history concurrently. Given a fixed bandwidth, when the number of actors increases the average latency increases. Once latency is high enough to cause the tasks to timeout before even the replay is done, the worker throughput will drop to zero since the Respond() calls to the cadence server will be rejected due to task timeout.

In order to prevent that, we need to guarantee that a decision task that started fetching history is prioritized over a decision task that hasn't started fetching history. So, I think what that means is that the replayer should hold the resources until it fetches all existing history pages rather than releasing & re-acquiring for each page.

davidporter-id-au · 2023-03-02T04:56:03Z

I would add a brief few lines about why this code is client-side and not server-side similar to how we discussed offline here

davidporter-id-au · 2023-03-02T06:23:55Z

internal/pahlimiter/limiter_test.go

+	t.Parallel()
+
+	t.Run("should not limit poll requests", func(t *testing.T) {
+		t.Parallel()


is the duplicate use of t.Parallel here intentional?

yep. the "outer" one allows this top-level test to run concurrently with other top-level tests. the "inner" ones allow each of its sub-tests to run concurrently with each other inside this test.

both are needed to allow all sub-tests in this file to run simultaneously.
e.g. remove the outer one and TestHistoryLimited will start, run all sub-tests, and complete them all before TestUnlimited is allowed to unblock its parallel tests.

(pretty sure about that interpretation anyway. it has been a while since I checked, but iirc tests do not "end" and allow the next to run until all their children are complete, unless parallel. regardless, calling it 2x at any point is a fatal error, so the test framework guarantees these are not truly duplicated or it would not pass)

Groxx force-pushed the limit_history_polls branch from cd3d6f4 to 2a04033 Compare March 1, 2023 10:49

Groxx commented Mar 1, 2023

View reviewed changes

emrahs reviewed Mar 2, 2023

View reviewed changes

davidporter-id-au reviewed Mar 2, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Groxx commented Mar 1, 2023 •

edited

Loading

Groxx Mar 1, 2023 •

edited

Loading

Groxx Mar 1, 2023 •

edited

Loading

emrahs Mar 2, 2023

Groxx Mar 2, 2023

emrahs Mar 2, 2023

Groxx Mar 2, 2023 •

edited

Loading

emrahs Mar 2, 2023

davidporter-id-au commented Mar 2, 2023

davidporter-id-au Mar 2, 2023

Groxx Mar 2, 2023 •

edited

Loading

		if bw.pollLimiter == nil \|\| bw.pollLimiter.Wait(bw.limiterContext) == nil {
		// TODO: block here on a shared semaphore with history-loading?

		// run N times in parallel to reduce noise.
		// none may have zeros, and it should be heavily skewed in favor of history requests

Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Are you sure you want to change the base?

Prototype of a rate-limiter intended to favor workflows getting history over polling for new workflows. #1221

Conversation

Groxx commented Mar 1, 2023 • edited Loading

Groxx Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

Groxx Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

emrahs Mar 2, 2023

Choose a reason for hiding this comment

Groxx Mar 2, 2023

Choose a reason for hiding this comment

emrahs Mar 2, 2023

Choose a reason for hiding this comment

Groxx Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

emrahs Mar 2, 2023

Choose a reason for hiding this comment

davidporter-id-au commented Mar 2, 2023

davidporter-id-au Mar 2, 2023

Choose a reason for hiding this comment

Groxx Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

Groxx commented Mar 1, 2023 •

edited

Loading

Groxx Mar 1, 2023 •

edited

Loading

Groxx Mar 1, 2023 •

edited

Loading

Groxx Mar 2, 2023 •

edited

Loading

Groxx Mar 2, 2023 •

edited

Loading