Configuration Anomaly Detection

Configuration Anomaly Detection

Configuration Anomaly Detection

About

Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.

Overview

CAD consists of:

a tekton deployment including a custom tekton interceptor
the cadctl command line tool implementing alert remediations and pre-investigations

Workflow

PagerDuty Webhooks are used to trigger Configuration-Anomaly-Detection when a PagerDuty incident is created
The webhook routes to a Tekton EventListener
Received webhooks are filtered by a Tekton Interceptor that uses the payload to evaluate whether the alert has an implemented handler function in cadctl or not. If there is no handler implemented, the alert is directly forwarded to a human SRE.
If cadctl implements a handler for the received payload/alert, a Tekton PipelineRun is started.
The pipeline runs cadctl which determines the handler function by itself based on the payload.

Contributing

Building

For build targets, see make help.

Adding a new investigation

CAD investigations are triggered by PagerDuty webhooks. Currently, CAD supports the following two formats of webhooks:

WebhookV3
EventOrchestrationWebhook

The required investigation is identified by CAD based on the incident and its payload. As PagerDuty itself does not provide finer granularity for webhooks than service-based, CAD filters out the alerts it should investigate. For more information, please refer to https://support.pagerduty.com/docs/webhooks.

To add a new alert investigation:

create a mapping for the alert to the GetInvestigation function in mapping.go and write a corresponding CAD investigation (e.g. Investigate() in chgm.go).
if the alert is not yet routed to CAD, add a webhook to the service your alert fires on. For production, the service should also have an escalation policy that escalates to SRE on CAD automation timeout.

Testing locally

Pre-requirements

an existing cluster
an existing PagerDuty incident for the cluster and alert type that is being tested

To quickly create an incident for a cluster_id, you can run ./test/generate_incident.sh <alertname> <clusterid>. Example usage:./test/generate_incident.sh ClusterHasGoneMissing 2b94brrrrrrrrrrrrrrrrrrhkaj.

Running cadctl for an incident ID

Export the required ENV variables, see required ENV variables.
Create a payload file containing the incident ID

export INCIDENT_ID=
echo '{"__pd_metadata":{"incident":{"id":"'${INCIDENT_ID}'"}}}' > ./payload

Run cadctl using the payload file

./bin/cadctl investigate --payload-path payload

Documentation

Investigations

Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.

Investigation specific documentation can be found in the according investigation folder, e.g. for ClusterHasGoneMissing.

Integrations

AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.

Templates

Update-Template -- Updating configuration-anomaly-detection-template.Template.yaml.
OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.

Dashboards

Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.

Deployment

Tekton -- Installation/configuration of Tekton and triggering pipeline runs.
Skip Webhooks -- Skipping the eventlistener and creating the pipelinerun directly.
Namespace -- Allowing the code to ignore the namespace.

Boilerplate

Boilerplate -- Conventions for OSD containers.

PipelinePruner

PipelinePruner -- Documentation about PipelineRun pruning.

Required ENV variables

CAD_OCM_CLIENT_ID: refers to the OCM client ID used by CAD to initialize the OCM client
CAD_OCM_CLIENT_SECRET: refers to the OCM client secret used by CAD to initialize the OCM client
CAD_OCM_URL: refers to the used OCM url used by CAD to initialize the OCM client
AWS_ACCESS_KEY_ID: refers to the access key id of the base AWS account used by CAD
AWS_SECRET_ACCESS_KEY: refers to the secret access key of the base AWS account used by CAD
CAD_AWS_CSS_JUMPROLE: refers to the arn of the RH-SRE-CCS-Access jumprole
CAD_AWS_SUPPORT_JUMPROLE: refers to the arn of the RH-Technical-Support-Access jumprole
CAD_ESCALATION_POLICY: refers to the escalation policy CAD should use to escalate the incident to
CAD_PD_EMAIL: refers to the email for a login via mail/pw credentials
CAD_PD_PW: refers to the password for a login via mail/pw credentials
CAD_PD_TOKEN: refers to the generated private access token for token-based authentication
CAD_PD_USERNAME: refers to the username of CAD on PagerDuty
CAD_SILENT_POLICY: refers to the silent policy CAD should use if the incident shall be silent
PD_SIGNATURE: refers to the PagerDuty webhook signature (HMAC+SHA256)
X_SECRET_TOKEN: refers to our custom Secret Token for authenticating against our pipeline
CAD_PROMETHEUS_PUSHGATEWAY: refers to the URL cad will push metrics to
BACKPLANE_URL: refers to the backplane url to use
BACKPLANE_INITIAL_ARN: refers to the initial ARN used for the isolated backplane jumprole flow

Optional ENV variables

BACKPLANE_PROXY: refers to the proxy CAD uses for the isolated backplane access flow.

Note: BACKPLANE_PROXY is required for local development, as a backplane api is only accessible through the proxy.

For Red Hat employees, these environment variables can be found in the SRE-P vault.

Name		Name	Last commit message	Last commit date
Latest commit History 584 Commits
.github		.github
boilerplate		boilerplate
cadctl		cadctl
dashboards		dashboards
deploy		deploy
hack		hack
images		images
interceptor		interceptor
openshift		openshift
pkg		pkg
test		test
.ci-operator.yaml		.ci-operator.yaml
.codecov.yml		.codecov.yml
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
go.mod		go.mod
go.sum		go.sum
project.mk		project.mk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Configuration Anomaly Detection

About

Overview

Workflow

Contributing

Building

Adding a new investigation

Testing locally

Pre-requirements

Running cadctl for an incident ID

Documentation

Investigations

Integrations

Templates

Dashboards

Deployment

Boilerplate

PipelinePruner

Required ENV variables

Optional ENV variables

About

Releases

Packages

Contributors 25

Languages

License

openshift/configuration-anomaly-detection

Folders and files

Latest commit

History

Repository files navigation

Configuration Anomaly Detection

About

Overview

Workflow

Contributing

Building

Adding a new investigation

Testing locally

Pre-requirements

Running cadctl for an incident ID

Documentation

Investigations

Integrations

Templates

Dashboards

Deployment

Boilerplate

PipelinePruner

Required ENV variables

Optional ENV variables

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 25

Languages

Packages