Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signficantly improve inhibitor performance via new cache datastructure #4134

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Spaceman1701
Copy link
Contributor

@Spaceman1701 Spaceman1701 commented Nov 22, 2024

This PR improves or matches performance of the existing inhibitor in almost all benchmark cases. When each inhibition rule has many inhibiting alerts, performance is improved by several orders of magnitude. No change to the inhibitor interface is necessary.

                                                     │    main.txt    │    icache-for-upstream-final.txt    │
                                                     │     sec/op     │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4            1.865µ ±  4%   1.597µ ±  7%  -14.35% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4          2.007µ ±  7%   1.464µ ±  5%  -27.04% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4         1.941µ ±  7%   1.447µ ± 16%  -25.46% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4        1.884µ ±  3%   1.422µ ± 10%  -24.55% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4       1.879µ ± 13%   1.419µ ±  4%  -24.49% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4          1.973µ ±  8%   1.404µ ±  4%  -28.86% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4         3.726µ ± 10%   1.348µ ± 13%  -63.84% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4       21.485µ ±  5%   1.411µ ±  3%  -93.43% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4     184.245µ ±  9%   1.489µ ±  6%  -99.19% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4    20.453µ ± 14%   1.409µ ±  5%  -93.11% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4           2.226µ ±  8%   1.912µ ±  8%  -14.13% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4          6.959µ ±  2%   6.293µ ±  5%   -9.57% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4         55.08µ ±  4%   48.42µ ±  3%  -12.10% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4        534.1µ ±  7%   488.9µ ±  4%   -8.45% (p=0.002 n=6)
geomean                                                  8.267µ         3.181µ        -61.52%                                            8.267µ         3.364µ        -59.31%

The new implementation replaces the InhibitRule's scache with an icache (meaning "intersection cache") which pre-calculates the equal label values a target alert would need to be inhibited by each inhibitor alert. Practically, this is implemented by calculating a fingerprint for just the equalLabels of each inhibiting alert and storing that alert in a map keyed by the equalLabels fingerprint. This allows O(1) access to the set of inhibiting alerts for an InhibitRule in the Inhibitor.Mutes method. The new data structure is essentially just:

map[model.Fingerpint]map[model.Fingerprint]*types.Alert
// equaLabelsFp -> fp -> alert

To insert a new inhibitor alert "A":

  1. does "A" match the source labels? If not, exit
  2. make a LabelSet of just the subset of A's labels which are in the InhibitRule's equal_labels. Compute the fingerpint for this LabelSet and call it "E"
  3. find the entry in the icache for E and insert A into it using A's fingerprint

To check if an alert "T" is inhibited:

  1. does T match the target labels? If not, exit. T is not inhibited
  2. make a LabelSet of just the subset of T's labels which are in the InhibitRule's equal_labels. Compute the fingerpint for this LabelSet and call it "E"
  3. does an entry exist in the InhibitRule's icache for E? If not, exit. T is not inhibited
  4. Find the first non-resolved alert in the icache entry and return it. T is inhibited by this alert.

The new implementation has a few other minor improvements which help performance: it reduces the number of time an inhibitor alert's fingerprint is calculated by caching it in the icache, it pre-calculates whether an inhibitor matches the source and target matchers for an inhibit rules, it calls time.Now() less frequently (similar to #4119), and it avoids checking if a target alert matches the source matchers unless necessary.

In the real world, this change helps most when an inhibit rule covers many inhibitors and target alerts. For example, a rule where all critical alerts inhibit all warning alerts with the same instance and alertname can result in many possible inhibitors attached to a rule where each inhibitor alert only inhibits one target alert. The old implementation of the inhibitor is extremely inefficient in this case: assuming N inhibiting alerts and a fixed number of equalLabels, it performs O(N) string comparisons per target. The new implementation is effectively O(1) in this case (except in a pathological case where all alerts in the inhibitor are resolved, since the new implementation still has to iterate past resolved alerts).

The new implementation does require slightly more memory per inhibiting alert than the old one: we store the alert's fingerprint and a boolean to check if the inhibitor matches both the source and the target. This works out to be about 16 extra bytes per cached inhibitor alert after accounting for padding. For 10,000 alerts, that's about 160KB. I think this is a totally acceptable increase in memory requirements.

We've been running a very similar patch in our production environment which adds all inhibiting alerts to the marker for more than a month. In our environment this still results in a more then 4x reduction in time spent in the Inhibitor.Mutes method.

This change was motivated by observations which show a large amount of CPU spent doing string comparisons in the inhibitor. It seems like we're not the only ones who have been looking for this, and I think this might be a good alternative to #3933.

@Spaceman1701 Spaceman1701 force-pushed the icache-with-first-result branch from 6b19c38 to 30e3747 Compare November 22, 2024 21:39
@Spaceman1701
Copy link
Contributor Author

cc @grobinson-grafana or @gotjosh is there any interest in this PR? Or something I can do to make this easier to review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant