Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metricprovider): add support for multiple formulas in dd metricprovider #3892

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

y-rabie
Copy link
Contributor

@y-rabie y-rabie commented Oct 13, 2024

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

This regards the Enhancement Proposal I've opened. It was not discussed but I went ahead and did the changes.
I've written unit tests for it, and built and tested the behavior with multiple formulas, and with a single formula.

These changes are backward-compatible.

Outstanding changes that needs to addressed would be in the UI, to visualize the values of all elements of the result vector each in a separate chart, instead of just the first value.

Copy link

codecov bot commented Oct 13, 2024

Codecov Report

Attention: Patch coverage is 88.73239% with 8 lines in your changes missing coverage. Please review.

Project coverage is 83.85%. Comparing base (b0d74e5) to head (0906090).

Files with missing lines Patch % Lines
metricproviders/datadog/datadog.go 87.69% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3892      +/-   ##
==========================================
- Coverage   83.87%   83.85%   -0.03%     
==========================================
  Files         163      163              
  Lines       18564    18609      +45     
==========================================
+ Hits        15571    15604      +33     
- Misses       2120     2128       +8     
- Partials      873      877       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Oct 13, 2024

Published E2E Test Results

  4 files    4 suites   3h 11m 17s ⏱️
113 tests 101 ✅  7 💤 5 ❌
462 runs  425 ✅ 28 💤 9 ❌

For more details on these failures, see this check.

Results for commit 0833186.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Oct 13, 2024

Published Unit Test Results

2 296 tests   2 296 ✅  2m 59s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit 0833186.

♻️ This comment has been updated with latest results.

Copy link

@zachaller zachaller self-requested a review October 17, 2024 15:51
@zachaller zachaller added this to the v1.8 milestone Oct 17, 2024
@zachaller zachaller modified the milestones: v1.8, v1.9 Dec 5, 2024
@y-rabie y-rabie force-pushed the support-datadog-multiple-formulas branch 2 times, most recently from 593a099 to 33e3f77 Compare December 6, 2024 12:45
@zachaller zachaller force-pushed the support-datadog-multiple-formulas branch from 33e3f77 to 5ddd301 Compare December 6, 2024 18:07
@y-rabie
Copy link
Contributor Author

y-rabie commented Dec 9, 2024

@zachaller I see this is labelled as v1.9, can it possibly be added to the v1.8 milestone?

@zachaller
Copy link
Collaborator

Sorry yea, my main reason for not putting this in 1.8 was just time (I need to get 1.8 out it's way overdue) and not having access and knowledge about Datadog as a metrics provider. Hence, the recent addition of plugins for metrics it's just a lot for us to maintain.

I tend to lean on @meeech for extra input on Datadog as a provider. Getting some docs on this feature would also be required as well.

I did already cut 1.8.0-rc1 and don't really want to add a feat into an already cut release but let me stew on it a bit and lets get this into master until then.

@zachaller zachaller force-pushed the support-datadog-multiple-formulas branch from 5ddd301 to fc758ff Compare December 11, 2024 21:12
Copy link
Contributor

@meeech meeech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for this. Cool stuff.

So a little nit, and a bigger question around how default and isNil have been changed. They close over for a reason, and I think it may have to do with at what point they are actually evaluted. But maybe it was done that way for no reason?

Also, I agree with @zachaller the docs definitely need updating around this.

tbh not even clear for the use case around this, or what issue its solving (want less api requests to DD? Just dont like the idea of this being 2 separate metric calls)? but I'm probably missing something.

So for sure some docs update at a minimum with example usage in an analysis template.

metricproviders/datadog/datadog.go Outdated Show resolved Hide resolved
@@ -100,8 +100,8 @@ func EvalCondition(resultValue any, condition string) (bool, error) {
"asFloat": asFloat,
"isNaN": math.IsNaN,
"isInf": isInf,
"isNil": isNilFunc(resultValue),
"default": defaultFunc(resultValue),
"isNil": isNil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats up with this change? (i looked, no commits I can refer to for rationale)? So I'm worried we might break something (because I can't comment on test confidence across all the metric providers - @zachaller let me know what your confidence level is there) - because for sure default is used a lot. the original looks like some curry func / closure(?) to me whereas now it isn't. I feel like a change like this can actually pass simple tests and still mess up in real usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some code archaeology and cant find the rationale for doing it that way initially. Just usually when you close over a value there is a reason.

Copy link
Contributor Author

@y-rabie y-rabie Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was baffled by the way it was done because it made no actual sense to me. I had to change it however because this extra level of indirection made it impossible to actually pass a dynamic result vector to the defaultFunc or isNil during expression compilation.

So with the original function, for a result vector of length 2 (assume non nil for now) and a condition like default(result[0]) > 10 && default(result[1]) < 4 for example, defaultFunc takes the result vector as an argument, and returns the function that just returns this result vector (i.e., it discards the first argument _ any) either way. Compiling an expression like the above would result in default(result[0]) returning the result vector, so the comparison ends up being vector > integer.

While the new implementation dynamically calls defaultFunc during compilation, and it actually considers the first argument, which in our case is the element result[0]/result[1] passed to it.

Originally, it didn't cause a problem, since it is one value not a vector, and so doing it either way would return the same value.

func isNilFunc(resultValue any) func(any) bool {
return func(_ any) bool {
return isNil(resultValue)
func defaultFunc(resultValue any, defaultValue any) any {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

@y-rabie
Copy link
Contributor Author

y-rabie commented Dec 11, 2024

@meeech Sure, I'll update the docs with clear examples, but for now a clearer description of an example use case for this is having a condition on two different (datadog) metrics simultaneously. I think the example in the enhancement proposal above should be sufficiently clear, but reframing it here

Currently, you can have multiple queries (every query is one dd metric), combine those queries into a formula, then writing a condition over that formula.

This new field makes it possible to have multiple queries (thus multiple dd metrics), and assert a combined condition over each individually, by making every one of the queries a formula, i.e.

- metric: dummy
  failureCondition: result[0] > 10 && result[1] < 5
  provider:
    datadog:
       queries:
          - a: "sum:something{}"
          - b: "avg:another{}"
        formulas:
           - "a"
           - "b"
        
  

(And obviously, the formulas themselves can be more complex)

I think the example in the proposal is really a common one, which is, fail if the rate of success (x) is below a threshold but only if the total volume (y) is significant enough. This coupling is not possible with only one formula/query per (analysis metric). Because if you put the separate conditions in two separate analysis metrics, what you get is: if x fails OR y fails, the whole analysis fails. What we need is: if x fails (small rate) AND y fails (significant volume), the whole analysis fails.

This is, like, the simplest example of the possible more complex conditions with multiple formulas. You can basically do any datadog queries you'd normally write in dd monitors here.

@y-rabie y-rabie force-pushed the support-datadog-multiple-formulas branch from fc758ff to 0833186 Compare December 11, 2024 22:54
@meeech
Copy link
Contributor

meeech commented Dec 11, 2024

@y-rabie sorry about that - i clicked the enchancement proposal link on the checklist - i missed the link in the body :/ I will have a look at it

@meeech
Copy link
Contributor

meeech commented Dec 11, 2024

So you make some interesting points: (was gonna add this to the proposal, but maybe better to just keep the conversation here at this point).

which is, fail if the rate of success (x) is below a threshold but only if the total volume (y) is significant enough.

If its not significant enough, then why are we trusting this number? This seems like a case where setting up an inconclusive result makes sense - for a human to come in and judge? Like, if I have a higher than allowed error rate, but not a ton of traffic, is this something I want to see blindly proceed? or maybe an analysis run step makes sense in that scenario that waits till we cross the collection threshhold? I guess I dont see the scenario where my error rate is higher than desired, but since I have low traffic, I ignore it. If anything it tells me my soak times arent long enough.

But the more interesting point you mention is this:

You can basically do any datadog queries you'd normally write in dd monitors here.

I assume you would still write those monitors? (which leads to having to keep them in step with your analysis templates)/ Rather than adding this complexity around multiple formulas, would it be more interesting and useful to add support for this end point https://api.datadoghq.com/api/v1/monitor/{monitor_id} so you could make analysis templates which are keyed to a monitor?

@y-rabie
Copy link
Contributor Author

y-rabie commented Dec 11, 2024

@meeech I agree, valid points regarding the example. I mean, maybe analysis steps are not always ideal when you need a continuously running analysis during the whole rollout.

But I digress, I realize critiquing the example helps critiquing if there is ever a use case for this, but I just don't think it's far-fetched that I need to have this sort of flexible combined condition over two (or more) different metrics? My brain can't come up with another example now. But generally the most basic thing is that I should be able to have this sort of ANDing, instead of ORing, of failures, and this cannot be done at the level of the Analysis metrics (nor should it I think).

I'll give it more thought and I'll add any other example I can come up with 😄

Really interesting thought about the monitors part. I'm afraid that semantically, it might be overloading the Analysis concept in Argo? It's metrics and measurements. Argo integrates with datadog as a metric provider, having something like monitors for datadog singularly breaks this concept. If added, should be there something to prevent alertmanger alerts from being integrated with Argo down the line?

@zachaller
Copy link
Collaborator

Just throwing this out there, I am not exactly sure what Datadog monitors are but based of what you two have said it sounds to me that they could/should be implemented as an analysis run plugin vs adding directly to the current dd provider. I would say the same thing about Alertmanager and the built in prometheus provider.

@meeech
Copy link
Contributor

meeech commented Dec 12, 2024

I mean, maybe analysis steps are not always ideal when you need a continuously running analysis during the whole rollout.

Agree. In this scenario, I was envisioning them being used in tandem. Specifically around ensuring you have hit your required traffic count, while the error count analysis is running in the background.

... My brain can't come up with another example now. ...

This is a trap we all fall into. :D I think we can always rationalize why we might need something (like I don't debate the utility of ANDing and ORing).

I guess I'm pushing back here only because this adds complexity to the dd metric provider (which then has to be maintained moving forward). And would add complexity to the analysis templates themselves.

I don't think it overloads the analysis concept. For me the analysis concept is "take this outside signal, and use it to evaluate the health of my release". its totally reasonable to say I don't want anything releasing when Im in an emergency state.
What I'm describing could probably already be accomplished with the webhook provider - thats as generic as they come.

The only reason I suggest to put it in the DD metric provider is because its a good fit since it already has the access it needs. Otherwise, it could be implemented with the new analysis provider plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants