Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Relic canary verification #3771

Open
akorzy-pl opened this issue Aug 5, 2024 · 5 comments · May be fixed by #3795
Open

New Relic canary verification #3771

akorzy-pl opened this issue Aug 5, 2024 · 5 comments · May be fixed by #3795
Labels
enhancement New feature or request

Comments

@akorzy-pl
Copy link
Contributor

akorzy-pl commented Aug 5, 2024

Summary

Analysis templates for New Relic canary verification using response time and error rate.

Motivation

Canary verification is most useful when it's automatic. Statistical analysis using averages and standard deviation can be used to detect anomalies in the response time and error rate. New Relic is an observability tool with its own query language NRQL. Generic reusable queries for canary analysis would be useful to the community.

Proposal

The implementation described below has been used for around nine months in production at Priceline.com.

Use Cases

Users of New Relic that want to increase software quality by using of automatic canary verification.

Security Considerations

The New Relic token needs to be stored in a secret in each namespace where canaries are to be deployed.

Risks and Mitigations

Some tuning of the parameters may be required for a given application to prevent false positives and negatives.

Goals

Provide analysis templates for response time and error rate analysis.

Non-Goals

Deploying the New Relic token is outside of scope of this proposal.

Implementation

Response time analysis template:

kind: ClusterAnalysisTemplate
apiVersion: argoproj.io/v1alpha1
metadata:
  name: new-relic.canary.response-time.verification
spec:
  args:
  - name: appName
  - name: rolloutName
  - name: analysisInitialDelay
    value: "4m"  # Initial delay before the analysis starts (use a duration in the Argo Rollouts format)
  - name: analysisInterval
    value: "1m"  # Interval between subsequent analysis executions (use a duration in the Argo Rollouts format)
  - name: analysisTimeWindow
    value: "70 seconds"  # Time window for the data query to New Relic (use a duration in the NRQL format)
  - name: analysisCount
    value: "60"  # Number of times the analysis will run
  - name: responseTimeDeviationThreshold
    value: "6.0"  # Threshold for acceptable response time deviation (in multiples of standard deviation)
  - name: responseTimeResolution
    value: "0.3"  # Resolution component for the response time calculation (in milliseconds)
  - name: limitOfTransactionNames
    value: "16"  # Maximum number of endpoints to analyze
  - name: inconclusiveLimit
    value: 3  # Maximum allowed inconclusive executions before marking the entire analysis as inconclusive
  - name: customConditionPrefix
    value: ""  # Prefix of custom query condition, e.g. to filter by a custom tag: "and tags.DC = '"
  - name: customConditionValue
    value: ""  # Value of custom query condition, e.g. the value of the tag: "us-east1"
  - name: customConditionSuffix
    value: ""  # Suffix of custom query condition, usually a closing apostrophe: "'"
  - name: stablePodHash
  - name: latestPodHash
  - name: new-relic.personal-api-key
    valueFrom:
      secretKeyRef:
        name: newrelic
        key: personal-api-key
  - name: new-relic.account-id
    valueFrom:
      secretKeyRef:
        name: newrelic
        key: account-id
  metrics:
  - name: "NR Canary Response Time"
    successCondition: "len(result) > 0 && all(result, {.responseTimeDeviation < {{ args.responseTimeDeviationThreshold }}})"
    failureCondition: "false"
    initialDelay: "{{ args.analysisInitialDelay }}"
    interval: "{{ args.analysisInterval }}"
    count: "{{ args.analysisCount }}"
    inconclusiveLimit: "{{ args.inconclusiveLimit }}"
    provider:
      web:
        method: POST
        url: "https://api.newrelic.com/graphql"
        timeoutSeconds: 120
        headers:
          - key: Content-Type
            value: "application/json"
          - key: API-Key
            value: "{{ args.new-relic.personal-api-key }}"
        jsonPath: "{$.data.actor.account.nrql.results}"
        jsonBody:
          query: |
              {
                actor {
                  account(id: {{ args.new-relic.account-id }}) {
                    nrql(
                      timeout: 120
                      query: """
                              select
                                  average(abs(`canary` - `stable`) / (`stdev` + {{ args.responseTimeResolution }})) as `responseTimeDeviation`,
                                  average(`canary`) as `canary`, average(`stable`) as `stable`, average(`stdev`) as `stdev`
                              from
                                  (
                                      select
                                          (filter(
                                              average(`apm.service.transaction.duration`),
                                              where
                                                host like '{{ args.rolloutName }}-{{ args.stablePodHash }}%'
                                          ) or -1.0) as `stable`,
                                          (filter(
                                              average(`apm.service.transaction.duration`),
                                              where
                                                host like '{{ args.rolloutName }}-{{ args.latestPodHash }}%'
                                          ) or -1.0) as `canary`,
                                          filter(
                                              stddev(`apm.service.transaction.duration`),
                                              where
                                                host like '{{ args.rolloutName }}-{{ args.stablePodHash }}%'
                                          ) as `stdev`
                                      from
                                          Metric
                                      where
                                          appName = '{{ args.appName }}' {{ args.customConditionPrefix }}{{ args.customConditionValue }}{{ args.customConditionSuffix }}
                                          and transactionName in (
                                              select
                                                  transactionName
                                              from
                                                  (
                                                      FROM
                                                          Metric
                                                      SELECT
                                                          count(`apm.service.transaction.duration`) FACET transactionName
                                                      WHERE
                                                          appName = '{{ args.appName }}' {{ args.customConditionPrefix }}{{ args.customConditionValue }}{{ args.customConditionSuffix }}
                                                      LIMIT
                                                          {{ args.limitOfTransactionNames }}
                                                  )
                                          ) FACET transactionName
                                  ) FACET transactionName since {{ args.analysisTimeWindow }} ago
                      """
                    ) {
                      results
                    }
                  }
                }
              }
          variables: ""

Error rate analysis template:

kind: ClusterAnalysisTemplate
apiVersion: argoproj.io/v1alpha1
metadata:
  name: new-relic.canary.error-rate.verification
spec:
  args:
  - name: appName
  - name: rolloutName
  - name: analysisInitialDelay
    value: "4m"  # Initial delay before the analysis starts (use a duration in the Argo Rollouts format)
  - name: analysisInterval
    value: "1m"  # Interval between subsequent analysis executions (use a duration in the Argo Rollouts format)
  - name: analysisTimeWindow
    value: "70 seconds"  # Time window for the data query to New Relic (use a duration in the NRQL format)
  - name: analysisCount
    value: "60"  # Number of times the analysis will run
  - name: errorRateDeviationThreshold
    value: "2.0"  # Threshold for acceptable error rate deviation (in multiples of existing error rate)
  - name: errorRateResolution
    value: "0.0001"  # Resolution component for the error rate calculation
  - name: limitOfTransactionNames
    value: "16"  # Maximum number of endpoints to analyze
  - name: inconclusiveLimit
    value: 3  # Maximum allowed inconclusive executions before marking the entire analysis as inconclusive
  - name: customConditionPrefix
    value: ""  # Prefix of custom query condition, e.g. to filter by a custom tag: "and tags.DC = '"
  - name: customConditionValue
    value: ""  # Value of custom query condition, e.g. the value of the tag: "us-east1"
  - name: customConditionSuffix
    value: ""  # Suffix of custom query condition, usually a closing apostrophe: "'"
  - name: stablePodHash
  - name: latestPodHash
  - name: new-relic.personal-api-key
    valueFrom:
      secretKeyRef:
        name: newrelic
        key: personal-api-key
  - name: new-relic.account-id
    valueFrom:
      secretKeyRef:
        name: newrelic
        key: account-id
  metrics:
  - name: "NR Canary Error Rate"
    successCondition: "all(result, {.errorRateDeviation < {{ args.errorRateDeviationThreshold }}})"
    failureCondition: "false"
    initialDelay: "{{ args.analysisInitialDelay }}"
    interval: "{{ args.analysisInterval }}"
    count: "{{ args.analysisCount }}"
    inconclusiveLimit: "{{ args.inconclusiveLimit }}"
    provider:
      web:
        method: POST
        url: "https://api.newrelic.com/graphql"
        timeoutSeconds: 120
        headers:
          - key: Content-Type
            value: "application/json"
          - key: API-Key
            value: "{{ args.new-relic.personal-api-key }}"
        jsonPath: "{$.data.actor.account.nrql.results}"
        jsonBody:
          query: |
              {
                actor {
                  account(id: {{ args.new-relic.account-id }}) {
                    nrql(
                      timeout: 120
                      query: """
                              select
                                  average(`canary` / (`stable` + {{ args.errorRateResolution }})) as `errorRateDeviation`,
                                  average(`canary`) as `canary`, average(`stable`) as `stable`
                              from
                                  (
                                      select
                                          (filter(
                                              count(`apm.service.transaction.error.count`) / count(`apm.service.transaction.duration`),
                                              where
                                                host like '{{ args.rolloutName }}-{{ args.stablePodHash }}%'
                                          ) or 0) as `stable`,
                                          (filter(
                                              count(`apm.service.transaction.error.count`) / count(`apm.service.transaction.duration`),
                                              where
                                                host like '{{ args.rolloutName }}-{{ args.latestPodHash }}%'
                                          ) or 0) as `canary`
                                      from
                                          Metric
                                      where
                                          appName = '{{ args.appName }}' {{ args.customConditionPrefix }}{{ args.customConditionValue }}{{ args.customConditionSuffix }}
                                          and transactionName in (
                                              select
                                                  transactionName
                                              from
                                                  (
                                                      FROM
                                                          Metric
                                                      SELECT
                                                          count(`apm.service.transaction.duration`) FACET transactionName
                                                      WHERE
                                                          appName = '{{ args.appName }}' {{ args.customConditionPrefix }}{{ args.customConditionValue }}{{ args.customConditionSuffix }}
                                                      LIMIT
                                                          {{ args.limitOfTransactionNames }}
                                                  )
                                          ) FACET transactionName
                                  ) FACET transactionName since {{ args.analysisTimeWindow }} ago
                      """
                    ) {
                      results
                    }
                  }
                }
              }
          variables: ""

Examples

Please find below an example of a Rollout definition:

metadata:
  labels:
    application: my-app
    cluster: cluster-name-1
    region: us-east1
spec:
  strategy:
    canary:
      steps:
        - setWeight: 1
        - analysis:
            analysisRunMetadata: {}
            args:
              - name: metricName
                value: 'Automatic canary verification'
              - name: appName
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['application']
              - name: rolloutName
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: customConditionPrefix
                value: "and tags.DC = '"
              - name: customConditionValue
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['region']
              - name: customConditionSuffix
                value: "'"
              - name: stablePodHash
                valueFrom:
                  podTemplateHashValue: Stable
              - name: latestPodHash
                valueFrom:
                  podTemplateHashValue: Latest
            templates:
              - clusterScope: true
                templateName: new-relic.canary.response-time.verification
              - clusterScope: true
                templateName: new-relic.canary.error-rate.verification
        - setWeight: 100

Upgrade/Downgrade Strategy

There is no impact for users who don't use this feature.

Drawbacks

Currently unknown ;)

Alternatives

A New Relic metric provider is available in Argo Rollouts, but it isn't used in this implementation, as the standard web provider will suffice.

Authors


Message from the maintainers:

Impacted by this missing feature? Give it a 👍. We prioritize the issues with the most 👍.

@akorzy-pl akorzy-pl added the enhancement New feature or request label Aug 5, 2024
@zachaller
Copy link
Collaborator

Does this require any change to argo rollouts today?

@akorzy-pl
Copy link
Contributor Author

@zachaller Thanks for the response. This doesn't require any change to argo rollouts today 😀 What's the best way to share this code?

@meeech
Copy link
Contributor

meeech commented Aug 8, 2024

Would it make sense as a community project with a collection of Analysis Templates? That could be useful practically, and as an examples tools.

@akorzynski
Copy link

As discussed in the contributor meeting, I'm going to go with modifying the documentation

akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 15, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 15, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzyński <[email protected]>
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 15, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzynski <[email protected]>
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 15, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzynski <[email protected]>
@akorzynski akorzynski linked a pull request Aug 15, 2024 that will close this issue
6 tasks
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 16, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzynski <[email protected]>
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 16, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzynski <[email protected]>
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 16, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzynski <[email protected]>
akorzynski added a commit to akorzynski/argo-rollouts that referenced this issue Aug 17, 2024
closes: argoproj#3771

Co-authored-by: Abhishek Gaikwad <[email protected]>
Signed-off-by: Aleksander Korzynski <[email protected]>
@akorzynski
Copy link

Please approve the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants