Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Frontend: Job weights #4076

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

Conversation

joe-elliott
Copy link
Member

What this PR does:
The query frontend treats all jobs as the same size when it farms them out to the queriers. This can cause querier instability b/c some jobs actually require quite a bit more resources to execute. By assigning weights to jobs we can reduce the amount each querier is asked to do will hopefully:

  1. reduce querier OOMs/timeouts/retries
  2. reduce querier latency
  3. increase total throughput

Other changes

  • Removed the roundtripper httpgrpc bridge and pushed the concept of pipeline.Request all the way down into the cortex frontend code. This can be a nice perf improvement b/c translating http -> httpgrpc is costly and we are pushing it to the last moment. Currently for some queries we are translating thousands of jobs and then throwing them away.
  • Removed redundant parseQuery and createFetchSpansRequest to consolidate on the Compile function in pkg/traceql
  • Check for context error before going through retry logic in retryWare. This causes retry metrics to be more accurate in the event of many cancelled jobs.

TODO

  • Fix existing tests
  • Add tests for two bits of functionality marked PRTODO
  • Balance weights. Potentially make them configurable.

Testing so far

  • Setting the trace by ID weight to 2 showed considerable performance improvement over main
  • The search weight seemed overly tuned. These reduced batch considerably causing an overall lower query latency. We should ease up on these weights.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

}
totalWeight += weight

if totalWeight >= requestedCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I suppose what we're saying here is that we request of this batch a certain high water mark of work that we're willing to take, and the weight increases the notion of complexity for a single item above this threshold. Implicitly here I suppose is that weight and requestedCount are of the same unit of measure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicitly here I suppose is that weight and requestedCount are of the same unit of measure.

yes! currently all jobs fill a single "slot" in the batch. the "weight" is basically just making it fill more slots.

}
}

if conditions > 4 { // yay, magic!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fine starting point. I was wonder if each condition is weight++, and maybe each regex is weight+2 or some such. It means for the queue logic that if any condition is present, we'll never consume the entire requested batch. 🤔

if query.Has("query") {
traceQLQuery = query.Get("query")
}
if traceQLQuery != "" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for ease of reading i prefer.

if traceQLQuery == "" {
   req.SetWeight(TraceQLSearchWeight)
   return
}

...

this reduces nesting of the code below and very clearly communicates the logic taken when the query is not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants