feat(metricprovider): add support for multiple formulas in dd metricprovider #3892

y-rabie · 2024-10-13T10:45:29Z

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

This regards the Enhancement Proposal I've opened. It was not discussed but I went ahead and did the changes.
I've written unit tests for it, and built and tested the behavior with multiple formulas, and with a single formula.

These changes are backward-compatible.

Outstanding changes that needs to addressed would be in the UI, to visualize the values of all elements of the result vector each in a separate chart, instead of just the first value.

codecov · 2024-10-13T11:17:12Z

Codecov Report

Attention: Patch coverage is 88.73239% with 8 lines in your changes missing coverage. Please review.

Project coverage is 83.85%. Comparing base (b0d74e5) to head (0906090).

Files with missing lines	Patch %	Lines
metricproviders/datadog/datadog.go	87.69%	5 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3892      +/-   ##
==========================================
- Coverage   83.87%   83.85%   -0.03%     
==========================================
  Files         163      163              
  Lines       18564    18609      +45     
==========================================
+ Hits        15571    15604      +33     
- Misses       2120     2128       +8     
- Partials      873      877       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-10-13T11:17:38Z

Published E2E Test Results

4 files 4 suites 3h 11m 17s ⏱️
113 tests 101 ✅ 7 💤 5 ❌
462 runs 425 ✅ 28 💤 9 ❌

For more details on these failures, see this check.

Results for commit 0833186.

♻️ This comment has been updated with latest results.

github-actions · 2024-10-13T11:17:45Z

Published Unit Test Results

2 296 tests 2 296 ✅ 2m 59s ⏱️
128 suites 0 💤
1 files 0 ❌

Results for commit 0833186.

♻️ This comment has been updated with latest results.

sonarqubecloud · 2024-10-13T15:48:39Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

y-rabie · 2024-12-09T19:30:38Z

@zachaller I see this is labelled as v1.9, can it possibly be added to the v1.8 milestone?

zachaller · 2024-12-10T17:59:19Z

Sorry yea, my main reason for not putting this in 1.8 was just time (I need to get 1.8 out it's way overdue) and not having access and knowledge about Datadog as a metrics provider. Hence, the recent addition of plugins for metrics it's just a lot for us to maintain.

I tend to lean on @meeech for extra input on Datadog as a provider. Getting some docs on this feature would also be required as well.

I did already cut 1.8.0-rc1 and don't really want to add a feat into an already cut release but let me stew on it a bit and lets get this into master until then.

meeech

Hey thanks for this. Cool stuff.

So a little nit, and a bigger question around how default and isNil have been changed. They close over for a reason, and I think it may have to do with at what point they are actually evaluted. But maybe it was done that way for no reason?

Also, I agree with @zachaller the docs definitely need updating around this.

tbh not even clear for the use case around this, or what issue its solving (want less api requests to DD? Just dont like the idea of this being 2 separate metric calls)? but I'm probably missing something.

So for sure some docs update at a minimum with example usage in an analysis template.

metricproviders/datadog/datadog.go

meeech · 2024-12-11T21:53:26Z

utils/evaluate/evaluate.go

@@ -100,8 +100,8 @@ func EvalCondition(resultValue any, condition string) (bool, error) {
 		"asFloat": asFloat,
 		"isNaN":   math.IsNaN,
 		"isInf":   isInf,
-		"isNil":   isNilFunc(resultValue),
-		"default": defaultFunc(resultValue),
+		"isNil":   isNil,


Whats up with this change? (i looked, no commits I can refer to for rationale)? So I'm worried we might break something (because I can't comment on test confidence across all the metric providers - @zachaller let me know what your confidence level is there) - because for sure default is used a lot. the original looks like some curry func / closure(?) to me whereas now it isn't. I feel like a change like this can actually pass simple tests and still mess up in real usage.

I did some code archaeology and cant find the rationale for doing it that way initially. Just usually when you close over a value there is a reason.

Yep, I was baffled by the way it was done because it made no actual sense to me. I had to change it however because this extra level of indirection made it impossible to actually pass a dynamic result vector to the defaultFunc or isNil during expression compilation.

So with the original function, for a result vector of length 2 (assume non nil for now) and a condition like default(result[0]) > 10 && default(result[1]) < 4 for example, defaultFunc takes the result vector as an argument, and returns the function that just returns this result vector (i.e., it discards the first argument _ any) either way. Compiling an expression like the above would result in default(result[0]) returning the result vector, so the comparison ends up being vector > integer.

While the new implementation dynamically calls defaultFunc during compilation, and it actually considers the first argument, which in our case is the element result[0]/result[1] passed to it.

Originally, it didn't cause a problem, since it is one value not a vector, and so doing it either way would return the same value.

meeech · 2024-12-11T22:00:18Z

utils/evaluate/evaluate.go

-func isNilFunc(resultValue any) func(any) bool {
-	return func(_ any) bool {
-		return isNil(resultValue)
+func defaultFunc(resultValue any, defaultValue any) any {


see comment above

y-rabie · 2024-12-11T22:44:58Z

@meeech Sure, I'll update the docs with clear examples, but for now a clearer description of an example use case for this is having a condition on two different (datadog) metrics simultaneously. I think the example in the enhancement proposal above should be sufficiently clear, but reframing it here

Currently, you can have multiple queries (every query is one dd metric), combine those queries into a formula, then writing a condition over that formula.

This new field makes it possible to have multiple queries (thus multiple dd metrics), and assert a combined condition over each individually, by making every one of the queries a formula, i.e.

- metric: dummy
  failureCondition: result[0] > 10 && result[1] < 5
  provider:
    datadog:
       queries:
          - a: "sum:something{}"
          - b: "avg:another{}"
        formulas:
           - "a"
           - "b"

(And obviously, the formulas themselves can be more complex)

I think the example in the proposal is really a common one, which is, fail if the rate of success (x) is below a threshold but only if the total volume (y) is significant enough. This coupling is not possible with only one formula/query per (analysis metric). Because if you put the separate conditions in two separate analysis metrics, what you get is: if x fails OR y fails, the whole analysis fails. What we need is: if x fails (small rate) AND y fails (significant volume), the whole analysis fails.

This is, like, the simplest example of the possible more complex conditions with multiple formulas. You can basically do any datadog queries you'd normally write in dd monitors here.

…rovider Signed-off-by: Youssef Rabie <[email protected]>

sonarqubecloud · 2024-12-11T22:55:02Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

meeech · 2024-12-11T23:03:29Z

@y-rabie sorry about that - i clicked the enchancement proposal link on the checklist - i missed the link in the body :/ I will have a look at it

meeech · 2024-12-11T23:32:28Z

So you make some interesting points: (was gonna add this to the proposal, but maybe better to just keep the conversation here at this point).

which is, fail if the rate of success (x) is below a threshold but only if the total volume (y) is significant enough.

If its not significant enough, then why are we trusting this number? This seems like a case where setting up an inconclusive result makes sense - for a human to come in and judge? Like, if I have a higher than allowed error rate, but not a ton of traffic, is this something I want to see blindly proceed? or maybe an analysis run step makes sense in that scenario that waits till we cross the collection threshhold? I guess I dont see the scenario where my error rate is higher than desired, but since I have low traffic, I ignore it. If anything it tells me my soak times arent long enough.

But the more interesting point you mention is this:

You can basically do any datadog queries you'd normally write in dd monitors here.

I assume you would still write those monitors? (which leads to having to keep them in step with your analysis templates)/ Rather than adding this complexity around multiple formulas, would it be more interesting and useful to add support for this end point https://api.datadoghq.com/api/v1/monitor/{monitor_id} so you could make analysis templates which are keyed to a monitor?

y-rabie · 2024-12-11T23:59:38Z

@meeech I agree, valid points regarding the example. I mean, maybe analysis steps are not always ideal when you need a continuously running analysis during the whole rollout.

But I digress, I realize critiquing the example helps critiquing if there is ever a use case for this, but I just don't think it's far-fetched that I need to have this sort of flexible combined condition over two (or more) different metrics? My brain can't come up with another example now. But generally the most basic thing is that I should be able to have this sort of ANDing, instead of ORing, of failures, and this cannot be done at the level of the Analysis metrics (nor should it I think).

I'll give it more thought and I'll add any other example I can come up with 😄

Really interesting thought about the monitors part. I'm afraid that semantically, it might be overloading the Analysis concept in Argo? It's metrics and measurements. Argo integrates with datadog as a metric provider, having something like monitors for datadog singularly breaks this concept. If added, should be there something to prevent alertmanger alerts from being integrated with Argo down the line?

zachaller · 2024-12-12T03:54:36Z

Just throwing this out there, I am not exactly sure what Datadog monitors are but based of what you two have said it sounds to me that they could/should be implemented as an analysis run plugin vs adding directly to the current dd provider. I would say the same thing about Alertmanager and the built in prometheus provider.

meeech · 2024-12-12T03:57:33Z

I mean, maybe analysis steps are not always ideal when you need a continuously running analysis during the whole rollout.

Agree. In this scenario, I was envisioning them being used in tandem. Specifically around ensuring you have hit your required traffic count, while the error count analysis is running in the background.

... My brain can't come up with another example now. ...

This is a trap we all fall into. :D I think we can always rationalize why we might need something (like I don't debate the utility of ANDing and ORing).

I guess I'm pushing back here only because this adds complexity to the dd metric provider (which then has to be maintained moving forward). And would add complexity to the analysis templates themselves.

I don't think it overloads the analysis concept. For me the analysis concept is "take this outside signal, and use it to evaluate the health of my release". its totally reasonable to say I don't want anything releasing when Im in an emergency state.
What I'm describing could probably already be accomplished with the webhook provider - thats as generic as they come.

The only reason I suggest to put it in the DD metric provider is because its a good fit since it already has the access it needs. Otherwise, it could be implemented with the new analysis provider plugin.

y-rabie mentioned this pull request Oct 13, 2024

Add support for multiple formulas in DD metricprovider #3891

Open

zachaller self-requested a review October 17, 2024 15:51

zachaller added this to the v1.8 milestone Oct 17, 2024

zachaller modified the milestones: v1.8, v1.9 Dec 5, 2024

y-rabie force-pushed the support-datadog-multiple-formulas branch 2 times, most recently from 593a099 to 33e3f77 Compare December 6, 2024 12:45

zachaller force-pushed the support-datadog-multiple-formulas branch from 33e3f77 to 5ddd301 Compare December 6, 2024 18:07

zachaller force-pushed the support-datadog-multiple-formulas branch from 5ddd301 to fc758ff Compare December 11, 2024 21:12

meeech suggested changes Dec 11, 2024

View reviewed changes

feat(metricprovider): add support for multiple formulas in dd metricp…

0833186

…rovider Signed-off-by: Youssef Rabie <[email protected]>

y-rabie force-pushed the support-datadog-multiple-formulas branch from fc758ff to 0833186 Compare December 11, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metricprovider): add support for multiple formulas in dd metricprovider #3892

feat(metricprovider): add support for multiple formulas in dd metricprovider #3892

y-rabie commented Oct 13, 2024 •

edited

Loading

codecov bot commented Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 13, 2024 •

edited

Loading

sonarqubecloud bot commented Oct 13, 2024

y-rabie commented Dec 9, 2024

zachaller commented Dec 10, 2024

meeech left a comment

meeech Dec 11, 2024

meeech Dec 11, 2024

y-rabie Dec 11, 2024 •

edited

Loading

meeech Dec 11, 2024

y-rabie commented Dec 11, 2024 •

edited

Loading

sonarqubecloud bot commented Dec 11, 2024

meeech commented Dec 11, 2024

meeech commented Dec 11, 2024

y-rabie commented Dec 11, 2024

zachaller commented Dec 12, 2024

meeech commented Dec 12, 2024 •

edited

Loading

feat(metricprovider): add support for multiple formulas in dd metricprovider #3892

Are you sure you want to change the base?

feat(metricprovider): add support for multiple formulas in dd metricprovider #3892

Conversation

y-rabie commented Oct 13, 2024 • edited Loading

codecov bot commented Oct 13, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 13, 2024 • edited Loading

Published E2E Test Results

github-actions bot commented Oct 13, 2024 • edited Loading

Published Unit Test Results

sonarqubecloud bot commented Oct 13, 2024

Quality Gate passed

y-rabie commented Dec 9, 2024

zachaller commented Dec 10, 2024

meeech left a comment

Choose a reason for hiding this comment

meeech Dec 11, 2024

Choose a reason for hiding this comment

meeech Dec 11, 2024

Choose a reason for hiding this comment

y-rabie Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

meeech Dec 11, 2024

Choose a reason for hiding this comment

y-rabie commented Dec 11, 2024 • edited Loading

sonarqubecloud bot commented Dec 11, 2024

Quality Gate passed

meeech commented Dec 11, 2024

meeech commented Dec 11, 2024

y-rabie commented Dec 11, 2024

zachaller commented Dec 12, 2024

meeech commented Dec 12, 2024 • edited Loading

y-rabie commented Oct 13, 2024 •

edited

Loading

codecov bot commented Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 13, 2024 •

edited

Loading

y-rabie Dec 11, 2024 •

edited

Loading

y-rabie commented Dec 11, 2024 •

edited

Loading

meeech commented Dec 12, 2024 •

edited

Loading