Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] add TTL to rollup-jobs #369

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

speezepearson
Copy link
Contributor

Now that rollup-jobs have retry logic, Gil points out that we probably want some safeguard against the jobs rattling around forever. Maybe we should assign each job an Instant to give up after?

Failure modes this defends against:

  • An error in the job-splitting logic results in a job being retried indefinitely, thereby clogging up one or more of our RollupExecutor actors.

Failure modes it doesn't:

  • Kairos gets super slow for some reason, making a lot of jobs time out when they "shouldn't", resulting in a backlog of millions of tiny jobs that will clog the system even after Kairos comes back up.

Failure modes it makes worse:

  • Rollups fall far behind, without MPortal going down, so we age jobs out of a legitimate backlog when the right thing to do is chew through them.

I'm not delighted by this tradeoff. It seems fundamentally difficult to distinguish between "a large legitimate backlog because we're behind" and "a large illegitimate backlog caused by some job being broken." Thoughts on this would be appreciated; or maybe I'll revisit this and see some clever solution later this week.

@orborde
Copy link
Contributor

orborde commented May 5, 2020

Going mostly off the change description:

This change confuses me because it doesn't seem to affect the "bottom line" of what happens when rollups are falling behind or when a rollup is stuck in a retry loop indefinitely.

Prior to this change, it seems like that bottom line is "there will be periods in the rollup data series where data is missing, and users querying that interval will see gaps in the graph, at least until someone unwedges the rollup infrastructure and allows it to backfill".

After this change, it ... seems like the bottom line is the same? Except that you'd need to kick off the backfill manually, since some of the rollup jobs probably aged out due to this TTL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants