diff --git a/README.md b/README.md index d29860b..feb8232 100644 --- a/README.md +++ b/README.md @@ -21,327 +21,97 @@ The ACI-Monitoring-Stack integrates the following key components: - Pre-configured ACI data collections queries, alerts, and dashboards (Work In Progress): The ACI-Monitoring-Stack provides a solid foundation for monitoring an ACI fabric with its pre-defined queries, dashboards, and alerts. While these tools are crafted based on best practices to offer immediate insights into network performance, they are not exhaustive. The strength of the ACI-Monitoring-Stack lies in its community-driven approach. Users are invited to contribute their expertise by providing feedback, sharing custom solutions, and helping enhance the stack. Your input helps to refine and expand the stack's capabilities, ensuring it remains a relevant and powerful tool for network monitoring. -# Demo Environment: - -Wanna take a look at the current Stack? Head to: - -https://64.104.255.11/ - -user: guest -password: guest - # Your Stack -Here you can see a high level diagram of the components used and how they interact together -```mermaid -flowchart-elk - subgraph ACI Monitoring Stack - G["Grafana"] - P[("Prometheus")] - L["Loki"] - PT["Promtail"] - SL["Syslog-ng"] - AM["Alertmanager"] - A["aci-exporter"] - G--"PromQL"-->P - G--"LogQL"-->L - P-->AM - PT-->L - SL-->PT - P--"Service Discovery"-->A - end - subgraph ACI - S["Switches"] - APIC["APIC"] - end - U["User"] - N["Notifications (Mail/Webex etc...)"] - V{Ver >= 6.1} - A--"API Queries"-->S - A--"API Queries"-->APIC - U-->G - AM-->N - S--"Syslog"-->V - APIC--"Syslog"-->V - V -->|Yes| PT - V -->|No| SL -``` - -# Stack Development -If you want to contribute to this project star from [Here](docs/development.md) - -# Stack Deployment - -## Pre Requisites -- Familiarity with Kubernetes: This installation guide is intended to assist with the setup of the ACI Monitoring stack and assumes prior familiarity with Kubernetes; it is not designed to provide instruction on Kubernetes itself. -- A Kubernetes Cluster: Currently the stack has been tested on `Upstream Kubernetes 1.30.x` and `Minikube`. - - Persistent Volumes: 10G should be plenty for a small/demo environment. Many storage provisioner support Volume expansion so should be easy to increase this post installation. - - Ability to expose services for: - - Access to the Grafana/Prometheus and Alert Manager dashboards: This will be ideally achieved via an `Ingress Controller` - - (Optional) Wildcard DNS Entries for the ingress controller domain. - - Syslog ingestion from ACI: Since the syslog can be sent via `UDP` or `TCP` it is more flexible to use expose these service directly via either a `NodePort` or a `LoadBalancer` service Type - - Cluster Compute Resources: This stack has been tested against a 500 node ACI fabric and was consuming roughly 8GB of RAM, CPU resources didn't seem to play a major role and any modern CPU should suffice. - - 1 Dedicated Namespace per instance: One Instance can monitor at least 500 switches. - - This is not strictly required but is suggested to keep the HELM configuration simple so the default K8s service names can be re-used see the [Config Preparation](#config-preparation) section for more details. - -- Helm: This stack is distributed as a helm chart and relies on 3rd party helm charts as well -- Connectivity from your Kubernetes Cluster to ACI either over Out Of Band or In Band - -# Installation - -If you are installing on Minikube please follow the [Minikube Preparation Steps](docs/minikube.md) and then **come back here.** - -## Config Preparation +To gain a comprehensive understanding of the ACI Monitoring Stack and its components it is helpful to break down the stack into separate functions. Each function focuses on a different aspect of monitoring the Cisco Application Centric Infrastructure (ACI) environment. -The ACI Monitoring Stack is a combination of several [Charts](charts/aci-monitoring-stack/charts), if you are familiar with Helm you are aware of the struggle to propagate dynamic values to sub-charts. For example, it is not possible to pass to a sub-chart the name of a service in a dynamic way. +## Fabric Discovery: -In order to simplify the user experience the `chart` comes with a few pre-configured parameters that are populated in the configurations of the various sub-charts. +The ACI monitoring stack uses Prometheus Service Discovery (HTTP SD) to dynamically discover and scrape targets by periodically querying a specified HTTP endpoint for a list of target configurations in JSON format. -For example the aci-exporter Service Name is pre-configured as `aci-exporter-svc` and this value is then passed to Prometheus as service Discovery URL. +The ACI Monitoring Stack needs only the IP addresses of the APICs, the Switches will be Auto Discovered. If switches are added or removed from the fabric no action is required from the end user. -All these values can be customized and if you need to you can refer to the [Values](charts/aci-monitoring-stack/values.yaml) file. - -*Note:* This is the first HELM char `camrossi` created, and he is sure it can be improved. If you have suggestions they are extremely welcome! :) - -### The aci-exporter - -The aci-exporter is the bridge between your Cisco ACI environment and the Prometheus monitoring ecosystem, for it to works it needs to know: -- `fabrics`: A list of fabrics and how to connect to the APICs. - - Requires a **ReadOnly** **Admin** User -- `service_discovery`: Configure if devices are reachable via Out Of Band (`oobMgmtAddr`) or InBand (`inbMgmtAddr`). - -*Note:* The switches are auto-discovered. - -This is done by setting the following Values in Helm: - -```yaml -aci_exporter: - # Profiles for different fabrics - fabrics: - fab1: - username: - password: - apic: - - https://IP1 - - https://IP2 - - https://IP3 - # service_discovery oobMgmtAddr|inbMgmtAddr - service_discovery: oobMgmtAddr - fab2: - username: - password: - apic: - - https://IP1 - - https://IP2 - - https://IP3 - # service_discovery oobMgmtAddr|inbMgmtAddr - service_discovery: inbMgmtAddr -``` -### Prometheus and Alert Manager - -Prometheus is installed via its [own Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) the options you need to set are: - -- The `ingress` config and the baseURL: These most likely are the same URL which can access `prometheus` and `alertmanager` -- Persistent Volume Capacity -- (Optional) `retentionSize`: this is only needed if you want to limit the retention by size. Keep in mind that if you run out of disk space Prometheus WILL stop working. -- (Optional) alertmanager `route`: these are used to send notifications via Mail/Webex etc... the complete syntax is available [Here](https://prometheus.io/docs/alerting/latest/configuration/#receiver-integration-settings) -Below an example: -```yaml -prometheus: - server: - ingress: - enabled: true - ingressClassName: "traefik" - hosts: - - aci-exporter-prom.apps.c1.cam.ciscolabs.com - baseURL: "http://aci-exporter-prom.apps.c1.cam.ciscolabs.com" - service: - retentionSize: 5GB - persistentVolume: - accessModes: ["ReadWriteOnce"] - size: 5Gi +```mermaid + flowchart-elk RL + P[("Prometheus")] + A["aci-exporter"] + APIC["APIC"] - alertmanager: - baseURL: "http://aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com" - ingress: - enabled: true - ingressClassName: "traefik" - hosts: - - host: aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com - paths: - - path: / - pathType: ImplementationSpecific - config: - route: - group_by: ['alertname'] - group_interval: 30s - repeat_interval: 30s - group_wait: 30s - receiver: 'webex' - receivers: - - name: webex - webex_configs: - - send_resolved: false - api_url: "https://webexapis.com/v1/messages" - room_id: "" - http_config: - authorization: - credentials: "" + APIC -- "API Query" --> A + A -- "HTTP SD" --> P ``` -If you use Webex here some [config steps](docs/webex.md) for you! - -### Grafana - -Grafana is installed via its [own Chart](https://github.com/grafana/helm-charts/tree/main/charts/grafana) the main options you need to set are: - -- The `ingress` config: External URL which can access Grafana. -- Persistent Volume Capacity -- (Optional) `adminPassword`: If not set will be auto generated and can be found in the `grafana` secret -- (Optional) `viewers_can_edit`: This allows users with a `view only` role to modify the dashboards and access `Explorer` to execute queries against `Pormetheus` and `Loki`. However, the user will not be able to save any changes. -- (Optional) `deploymentStrategy`: if Grafana `Persistent Volume` is of type `ReadWriteOnce` rolling updates will get stuck as the new pod cannot start before the old one releases the PVC. Setting `deploymentStrategy.type` to `Recreate` destroy the original pod before starting the new one. +## ACI Object Scraping: -Below an example: +`Prometheus` scraping is the process by which `Prometheus` periodically collects metrics data by sending HTTP requests to predefined endpoints on monitored targets. The `aci-exporter` translates ACI-specific metrics into a format that `Prometheus` can ingest, ensuring that all crucial data points are captured and monitored effectively. -```yaml -grafana: - grafana.ini: - users: - viewers_can_edit: "True" - adminPassword: - deploymentStrategy: - type: Recreate - ingress: - ingressClassName: "traefik" - enabled: true - hosts: - - aci-exporter-grafana.apps.c1.cam.ciscolabs.com - persistence: - enabled: true - size: 2Gi +```mermaid + flowchart-elk RL + P[("Prometheus")] + A["aci-exporter"] + subgraph ACI + S["Switches"] + APIC["APIC"] + end + A--"Scraping"-->P + S--"API Queries"-->A + APIC--"API Queries"-->A ``` -### Syslog config - -The syslog config is the most complicated part as it relies on 3 components (`promtail`, `loki` and `syslog-ng`) with their own individual configs. Furthermore, there are two issues we need to overcome: - -- The Syslog messages don't contain the ACI Fabric name: to be able to distinguish the messaged from one fabric to another the only solution is to use dedicated `external services` with unique `IP:Port` pair per Fabric. -- Until ACI 6.1 we need `syslog-ng` between `ACI` and `Promtail` to convert from RFC 3164 to 5424 - *Note*: Promtail 3.1.0 adds support for RFC 3164 however this **DOES NOT** work for Cisco Switches and still requires syslog-ng. syslog-ng `syslog-parser` has extensive logic to handle all the complexities (and inconsistencies) of RFC 3164 messages. - -#### Loki - -Loki is deployed with the [Simple Scalable](https://grafana.com/docs/loki/latest/get-started/deployment-modes/#simple-scalable) Profile and is composed of a `backend`, `read` and `write` deployment with a replica of 3. - -The `backend` and `write` deployments requires persistent volumes. This chart is pre-configured to allocate 2Gi Volumes for each deployment (a total of 6 PVC will be created): -- `3 x data-loki-backend-X` -- `3 x data-loki-write-X` - -The PVC Size can be easily changed if required. +## Syslog Ingestion: -Loki also requires an `Object Store`. This chart is pre-configured to deploy [minio](https://min.io/). *Note:* Currently [Loki Chart](https://github.com/grafana/loki/tree/main/production/helm/loki) is deploying a very old version of `Minio` and there is a [PR open](https://github.com/grafana/loki/pull/11409) to address this already. +The syslog config is composed of 3 components: `promtail`, `loki` and `syslog-ng`. +Prior to ACI 6.1 `syslog-ng` is required between `ACI` and `Promtail` to convert from RFC 3164 to 5424 syslog message format. -Loki also support `chunks-cache` via `memcached`. The default config allocates 8G of memory. I have decreased this to 1G by default. - -If you want to change any of these parameters check the `loki` section in the [Values](charts/aci-monitoring-stack/values.yaml) file. - -Assuming the default parameters are acceptable the only required config for loki is to set the `rulerConfig.external_url` to point to the Grafana `ingress` URL - -```yaml -loki: - loki: - rulerConfig: - external_url: http://aci-exporter-grafana.apps.c1.cam.ciscolabs.com +```mermaid + flowchart-elk LR + L["Loki"] + PT["Promtail"] + SL["Syslog-ng"] + PT-->L + SL-->PT + subgraph ACI + S["Switches"] + APIC["APIC"] + end + V{Ver >= 6.1} + S--"Syslog"-->V + APIC--"Syslog"-->V + V -->|Yes| PT + V -->|No| SL ``` -### Promtail and Syslog-ng - -These two components are tightly coupled together. - -- Syslog-ng translates logs from RFC 3164 to RFC 5424 and forwards them to Promtail. -- Promtail is ingesting logs in RFC 5424 format and forwards them to Loki. - -Promtail is pre-configured with: +## Data Visualization -- Deployment Mode with 1 replica -- Loki Push Gateway url: `loki-gateway` This is the Loki Gateway K8s service name. -- Auto generated `scrapeConfigs` that will map a Fabric to a `IP:Port` Pair. +The Data Visualization is handled by `Grafana`, an open-source analytics and monitoring platform that allows users to visualize, query, and analyze data from various sources through customizable and interactive dashboards. It supports a wide range of data sources, including `Prometheus` and `Loki` enabling users to create real-time visualizations, alerts, and reports to monitor system performance and gain actionable insights. -These setting can be easily changed if required check the `Promtail` section in the [Values](charts/aci-monitoring-stack/values.yaml) file for more details. - -Syslog-ng is pre-configured with: -- Deployment Mode with 1 replica - -If you are happy with my defaults the only configs required are setting the `extraPorts` for Loki and `services` for Syslog-ng. You will need one entry per fabric and the portsd needs to "match", see the diagram below for a visual representation. -`Syslog-ng` is only needed for ACI < 6.1 - -Below a diagram of what is our goal for an ACI 6.1 fabric and an ACI 5.2 one. ```mermaid -flowchart-elk - subgraph K8s Cluster - subgraph Promtail - PT1513["TCP:1513 label:fab1"] - PT1514["TCP:1514 label:fab2"] - end - subgraph Syslog-ng - SL["UDP:1514"] - end - F1SVC["LoadBalancerIP TCP:1513"] - F2SVC["LoadBalancerIP UDP:1514"] - - F1SVC --> PT1513 - F2SVC --> SL - end - ACI61["ACI Fab1 Ver. 6.1"] --> F1SVC - ACI52["ACI Fab2 Ver. 5.2"] --> F2SVC - SL --> PT1514 - + flowchart-elk RL + G["Grafana"] + L["Loki"] + P[("Prometheus")] + U["User"] + + P--"PromQL"-->G + L--"LogQL"-->G + G-->U ``` +## Alerting -The above architecture can be achieved with the following config: - -- `name`: This will set the `fabric` labels for the logs received by Loki -- `containerPort`: The port the container listen to. This is mapping a logs stream to a fabric -- `service.type`: I would suggest to set this to either `NodePort` or `LoadBalancer`. Regardless this IP allocated MUST be reachable by all the Fabric Nodes. -- `service.port`: The port the `LoadBalancer` service is listening to, this will be the port you set into the ACI Syslog config. -- `service.nodePort`: The port the `NodePort` service is listening to, this will be the port you set into the ACI Syslog config. - -```yaml -promtail: - extraPorts: - fab1: - name: fab1 - containerPort: 1513 - service: - type: LoadBalancer - port: 1513 - fab2: - name: fab2 - containerPort: 1516 - service: - type: ClusterIP +`Alertmanager` is a component of the `Prometheus` ecosystem designed to handle alerts generated by `Prometheus`. It manages the entire lifecycle of alerts, including deduplication, grouping, silencing, and routing notifications to various communication channels like email, `Webex`, `Slack`, and others, ensuring that alerts are delivered to the right people in a timely and organized manner. -syslog: - services: - fab2: - name: fab2 - containerPort: 1516 - protocol: UDP - service: - type: LoadBalancer - port: 1516 +In the ACI Monitoring Stack both `Prometheus` and `Loki` are configured with alerting rules. +```mermaid +flowchart-elk LR + L["Loki"] + P["Prometheus"] + AM["Alertmanager"] + N["Notifications (Mail/Webex etc...)"] + L --> AM + P --> AM + AM --> N ``` +# [Demo Environment Access and Use](docs/demo-environment.md) -### ACI Syslog Config -If you need a reminder on how to configure ACI Syslog take a look [Here](docs/syslog.md) - -Here an [Example Config for 4 Fabrics](docs/4-fabric-example.yaml) +# [Stack Deployment Guide](docs/deployment.md) -## Chart Deployment - -- Create a file containing all your configs i.e.: `aci-mon-stack-config.yaml` - -```shell -helm repo add aci-monitoring-stack https://datacenter.github.io/aci-monitoring-stack -helm repo update -helm -n aci-mon-stack upgrade --install --create-namespace aci-mon-stack aci-monitoring-stack/aci-monitoring-stack -f aci-mon-stack-config.yaml -``` +# [Stack Development Guide](docs/development.md) diff --git a/charts/aci-monitoring-stack/templates/loki/loki-configmap-alerts.yaml b/charts/aci-monitoring-stack/templates/loki/loki-configmap-alerts.yaml index b0fbfbf..9796db9 100644 --- a/charts/aci-monitoring-stack/templates/loki/loki-configmap-alerts.yaml +++ b/charts/aci-monitoring-stack/templates/loki/loki-configmap-alerts.yaml @@ -1,3 +1,4 @@ +{{- if $.Values.loki.enabled }} {{- $files := .Files.Glob "alerts/loki/*.yaml" }} {{- if $files }} apiVersion: v1 @@ -20,4 +21,5 @@ items: data: {{ $dashboardName }}.yaml: {{ $.Files.Get $path | toYaml | indent 4 }} {{- end }} +{{- end }} {{- end }} \ No newline at end of file diff --git a/charts/aci-monitoring-stack/templates/loki/loki-configmaps-datasources.yaml b/charts/aci-monitoring-stack/templates/loki/loki-configmaps-datasources.yaml index 3fc4688..5931d36 100644 --- a/charts/aci-monitoring-stack/templates/loki/loki-configmaps-datasources.yaml +++ b/charts/aci-monitoring-stack/templates/loki/loki-configmaps-datasources.yaml @@ -1,4 +1,5 @@ {{- if $.Values.grafana.sidecar.datasources.enabled }} +{{- if $.Values.loki.enabled }} apiVersion: v1 kind: ConfigMap metadata: @@ -34,4 +35,5 @@ data: {{- end }} {{- end }} +{{- end }} {{- end }} \ No newline at end of file diff --git a/docs/LABDCN-2620/README.md b/docs/LABDCN-2620/README.md new file mode 100644 index 0000000..5c7988a --- /dev/null +++ b/docs/LABDCN-2620/README.md @@ -0,0 +1,5 @@ +# LABDCN-2620: Open Source Monitoring for Cisco ACI - Cisco Live APJC 2024 + +This section contains specific instruction on how to run the LABDCN-2620 Walk In Lab. +This lab runs on a pre-existing Kubernetes cluster and can support up to 30 concurrent students. + diff --git a/docs/demo-environment.md b/docs/demo-environment.md new file mode 100644 index 0000000..fcea891 --- /dev/null +++ b/docs/demo-environment.md @@ -0,0 +1,114 @@ +# Access + +The Demo environment is hosted in a DMZ and ca be accessed with the following credentials: + +https://64.104.255.11/ + +user: `guest` +password: `guest` + +The guest user is able to modify the dashboards and run `Explore` queries however it can't save any of the configuration changes. + +# Exploring the ACI Monitoring Stack + +In this section I am gonna guide you trough the available dashboards and how to use them. + +*Note:* Grafana support building dashboard with data coming from Multiple data source but for the moment, the ACI Monitoring stack does not make use of such capability. + +All the Dashboards are located in the `ACI` Folder in the `Dashboards` section of the UI: +![dashboards](images/dashboards.png) + + +## Prometheus backed Dashboards + +These dashboards are using `Prometheus` as data source meaning the data we are visualizing came from an ACI Managed Object and was translated by the `aci-exporter` + +### ACI Faults +This dashboard is a 1:1 copy of the faults that are present inside ACI. The main advantages copmpared to looking at the faults in the ACI UI are: +- the ability to aggregating Faults from Multiple Fabrics in a single table +- allowing advanced sorting and filtering + + +![faults](images/faults.png) + +By using the `Fabric` drop down menu you can select different Fabrics (or All) and you can use the Colum headers to filter/sort the data: + + + + +This is a good dashboard to understand how Grafana dashboards are built, if you are interested on building your own dashboard you can take a look [here](labs/lab1.md). + + + +### EPG Explore + +The EPG Explore is composed of 2 tables: +- EPG To Interface - VLANs: This table allows the user to map a EPG to a VLAN port on a switch. This table can be filtered by: + - fabric + - tenant + - epg +- V(x)LANs to EPG - Interface: This table allows the user to map a VLAN to an EPG and a port on a switch. This table can be filtered by: + - VLAN + - VXLAN + +*Limitations:* This has not yet been tested with overlapping VLANs + +### EPG Stats + +This dashboard contains the following time series graphs: + +- EPG RX Gbits/s: This show the Received traffic in the EPG +- EPG TX Gbits/s: This show the Transmitted traffic by the EPG +- EPG Drops RX Pkts/s: This show the number of Packet drops in the ingress direction +- EPG Drops TX Pkts/s: This show the number of Packet drops in the egress direction + +These dashboards are built with the same logic as the ACI EPG Stats dashboards, just in Grafana + +### Fabric Capacity + +This dashboard contains the same info as the APIC Fabric Capacity dashboard but allows to plot the resource usage over a time period to better monitor the fabric utilization over time + +### Node Capacity + +This dashboard contains the same info as the APIC FabrNodeic Capacity dashboard but allows to plot the resource usage over a time period to better monitor the fabric utilization over time + +### Node Details + +This dashboard contains the following time series graphs: + +- Node CPU Usage +- Node Memory Usage +- Node Health + +### Nodes Interfaces + +This dashboard contains the following graphs: + +- Node Interface status: This dashboard shows which interface are Up/Down +- Interface RX/TX Usage: This dashboard shows the interface utilization in %, it is sorted by highest usage and will display the top 10 interfaces by usage. + +### Power Usage + +This dashboard display a time series graph of the average power draw per switch + +### Routing Protocols + +This dashboard contains the following graphs: + +- L3 Neighbours: For every BGP or OSPF neighbors we display the Node to Peer IP peering, the routing protocol used the State of the connect etc... +- BGP Advertised/Received Paths: For every BGP peering we display the number of paths received/advertised +- BGP Accepted Paths: Time series graph of **received** BGP prefixes + +### Vlans + +Display the APIC config for VLAN Pools and VMM Custom Trunk Ports in filterable tables. + +## Loki backed Dashboards + +These dashboards are using `Loki` as data source meaning the data we are visualizing came from an ACI Syslog Message + +### Contract Drops Logs + +This dashboard parses the logs received by the switches and extract infos on the Contract Drop Logs. This requires a specific [config](syslog.md) on ACI and is limited to 500 Messages/s per switch + + diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..26ca780 --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,278 @@ + +# Stack Deployment + +## Pre Requisites +- Familiarity with Kubernetes: This installation guide is intended to assist with the setup of the ACI Monitoring stack and assumes prior familiarity with Kubernetes; it is not designed to provide instruction on Kubernetes itself. +- A Kubernetes Cluster: Currently the stack has been tested on `Upstream Kubernetes 1.30.x` `Minikube` and `k3s` + - Persistent Volumes: A total 10G should be plenty for a small/demo environment. Many storage provisioner support Volume expansion so should be easy to increase this post installation. + - Ability to expose services for: + - Access to the `Grafana`, `Prometheus` and `Alertmanager` dashboards: This will be ideally achieved via an `Ingress Controller` + - (Optional) Wildcard DNS Entries for the ingress controller domain. + - Syslog ingestion from ACI: Since the syslog can be sent via `UDP` or `TCP` it is required to expose these service directly via either a `NodePort` or a `LoadBalancer` service Type + - Cluster Compute Resources: This stack has been tested against a 500 node ACI fabric and was consuming roughly 8GB of RAM, CPU resources didn't seem to play a major role and any modern CPU should suffice. + - 1 Dedicated Namespace per instance: One Instance can monitor at least 500 switches. + - This is not strictly required but is suggested to keep the HELM configuration simple so the default K8s service names can be re-used see the [Config Preparation](#config-preparation) section for more details. +- Helm: This stack is distributed as a helm chart and relies on 3rd party helm charts as well +- Connectivity from your Kubernetes Cluster to ACI either over Out Of Band or In Band + +## Installation + +If you are installing on Minikube please follow the [Minikube Preparation Steps](docs/minikube.md) and then **come back here.** + +## Config Preparation + +The ACI Monitoring Stack is a combination of several [Charts](charts/aci-monitoring-stack/charts), if you are familiar with Helm you are aware of the struggle to propagate dynamic values to sub-charts. For example, it is not possible to pass to a sub-chart the name of a service in a dynamic way. + +In order to simplify the user experience the `chart` comes with a few pre-configured parameters that are populated in the configurations of the various sub-charts. + +For example the aci-exporter Service Name is pre-configured as `aci-exporter-svc` and this value is then passed to Prometheus as service Discovery URL. + +All these values can be customized and if you need to you can refer to the [Values](charts/aci-monitoring-stack/values.yaml) file. + +*Note:* This is the first HELM char `camrossi` created, and he is sure it can be improved. If you have suggestions they are extremely welcome! :) + +### aci-exporter + +The aci-exporter is the bridge between your Cisco ACI environment and the `Prometheus` monitoring ecosystem, for it to works it needs to know: +- `fabrics`: A list of fabrics and how to connect to the APICs. + - Requires a **ReadOnly** **Admin** User +- `service_discovery`: Select if devices are reachable via Out Of Band (`oobMgmtAddr`) or InBand (`inbMgmtAddr`). + +*Note:* The switches are auto-discovered. + +This is done by setting the following Values in Helm: + +```yaml +aci_exporter: + # Profiles for different fabrics + fabrics: + fab1: + username: + password: + apic: + - https://IP1 + - https://IP2 + - https://IP3 + # service_discovery oobMgmtAddr|inbMgmtAddr + service_discovery: oobMgmtAddr + fab2: + username: + password: + apic: + - https://IP1 + - https://IP2 + - https://IP3 + # service_discovery oobMgmtAddr|inbMgmtAddr + service_discovery: inbMgmtAddr +``` +### Prometheus and Alertmanager + +Prometheus is installed via its [own Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) the options you need to set are: + +- The `ingress` config and the baseURL: These most likely are the same URL which can access `prometheus` and `alertmanager` +- Persistent Volume Capacity +- (Optional) `retentionSize`: this is only needed if you want to limit the retention by size. Keep in mind that if you run out of disk space Prometheus WILL stop working. +- (Optional) alertmanager `route`: these are used to send notifications via Mail/Webex etc... the complete syntax is available [Here](https://prometheus.io/docs/alerting/latest/configuration/#receiver-integration-settings) +Below an example: +```yaml +prometheus: + server: + ingress: + enabled: true + ingressClassName: "traefik" + hosts: + - aci-exporter-prom.apps.c1.cam.ciscolabs.com + baseURL: "http://aci-exporter-prom.apps.c1.cam.ciscolabs.com" + service: + retentionSize: 5GB + persistentVolume: + accessModes: ["ReadWriteOnce"] + size: 5Gi + + alertmanager: + baseURL: "http://aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com" + ingress: + enabled: true + ingressClassName: "traefik" + hosts: + - host: aci-exporter-alertmanager.apps.c1.cam.ciscolabs.com + paths: + - path: / + pathType: ImplementationSpecific + config: + route: + group_by: ['alertname'] + group_interval: 30s + repeat_interval: 30s + group_wait: 30s + receiver: 'webex' + receivers: + - name: webex + webex_configs: + - send_resolved: false + api_url: "https://webexapis.com/v1/messages" + room_id: "" + http_config: + authorization: + credentials: "" +``` + +If you use Webex here some [config steps](docs/webex.md) for you! + +### Grafana + +`Grafana` is installed via its [own Chart](https://github.com/grafana/helm-charts/tree/main/charts/grafana) the main options you need to set are: + +- The `ingress` config: External URL which can access Grafana. +- Persistent Volume Capacity +- (Optional) `adminPassword`: If not set will be auto generated and can be found in the `grafana` secret +- (Optional) `viewers_can_edit`: This allows users with a `view only` role to modify the dashboards and access `Explorer` to execute queries against `Pormetheus` and `Loki`. However, the user will not be able to save any changes. +- (Optional) `deploymentStrategy`: if Grafana `Persistent Volume` is of type `ReadWriteOnce` rolling updates will get stuck as the new pod cannot start before the old one releases the PVC. Setting `deploymentStrategy.type` to `Recreate` destroy the original pod before starting the new one. + +Below an example: + +```yaml +grafana: + grafana.ini: + users: + viewers_can_edit: "True" + adminPassword: + deploymentStrategy: + type: Recreate + ingress: + ingressClassName: "traefik" + enabled: true + hosts: + - aci-exporter-grafana.apps.c1.cam.ciscolabs.com + persistence: + enabled: true + size: 2Gi +``` +### Syslog config + +The syslog config is the most complicated part as it relies on 3 components (`promtail`, `loki` and `syslog-ng`) with their own individual configs. Furthermore, there are two issues we need to overcome: + +- The Syslog messages don't contain the ACI Fabric name: to be able to distinguish the messaged from one fabric to another the only solution is to use dedicated `external services` with unique `IP:Port` pair per Fabric. +- Until ACI 6.1 we need `syslog-ng` between `ACI` and `Promtail` to convert from RFC 3164 to 5424 + *Note*: Promtail 3.1.0 adds support for RFC 3164 however this **DOES NOT** work for Cisco Switches and still requires syslog-ng. syslog-ng `syslog-parser` has extensive logic to handle all the complexities (and inconsistencies) of RFC 3164 messages. + +### Loki + +Loki is deployed with the [Simple Scalable](https://grafana.com/docs/loki/latest/get-started/deployment-modes/#simple-scalable) Profile and is composed of a `backend`, `read` and `write` deployment with a replica of 3. + +The `backend` and `write` deployments requires persistent volumes. This chart is pre-configured to allocate 2Gi Volumes for each deployment (a total of 6 PVC will be created): +- `3 x data-loki-backend-X` +- `3 x data-loki-write-X` + +The PVC Size can be easily changed if required. + +Loki also requires an `Object Store`. This chart is pre-configured to deploy [minio](https://min.io/). *Note:* Currently [Loki Chart](https://github.com/grafana/loki/tree/main/production/helm/loki) is deploying a very old version of `Minio` and there is a [PR open](https://github.com/grafana/loki/pull/11409) to address this already. + +Loki also support `chunks-cache` via `memcached`. The default config allocates 8G of memory. I have decreased this to 1G by default. + +If you want to change any of these parameters check the `loki` section in the [Values](charts/aci-monitoring-stack/values.yaml) file. + +Assuming the default parameters are acceptable the only required config for loki is to set the `rulerConfig.external_url` to point to the Grafana `ingress` URL + +```yaml +loki: + loki: + rulerConfig: + external_url: http://aci-exporter-grafana.apps.c1.cam.ciscolabs.com +``` + +### Promtail and Syslog-ng + +These two components are tightly coupled together. + +- `Syslog-ng` translates logs from RFC 3164 to RFC 5424 and forwards them to `Promtail`. +- `Promtail` is ingesting logs in RFC 5424 format and forwards them to `Loki`. + +`Promtail` is pre-configured with: + +- Deployment Mode with 1 replica +- Loki Push Gateway url: `loki-gateway` This is the Loki Gateway K8s service name. +- Auto generated `scrapeConfigs` that will map a Fabric to a `IP:Port` Pair. + +These setting can be easily changed if required check the `Promtail` section in the [Values](charts/aci-monitoring-stack/values.yaml) file for more details. + +`Syslog-ng` is pre-configured with: +- Deployment Mode with 1 replica + +If you are happy with my defaults the only configs required are setting the `extraPorts` for `Loki` and `services` for `Syslog-ng`. You will need one entry per fabric and the ports needs to "match", see the diagram below for a visual representation. + +`Syslog-ng` is only needed for ACI < 6.1 + +Below a diagram of what is our goal for an ACI 6.1 fabric and an ACI 5.2 one. +```mermaid +flowchart-elk LR + subgraph K8s Cluster + subgraph Promtail + PT1513["TCP:1513 label:fab1"] + PT1514["TCP:1514 label:fab2"] + end + subgraph Syslog-ng + SL["UDP:1514"] + end + F1SVC["LoadBalancerIP TCP:1513"] + F2SVC["LoadBalancerIP UDP:1514"] + + F1SVC --> PT1513 + F2SVC --> SL + end + subgraph ACI + ACI61["ACI Fab1 Ver. 6.1"] --> F1SVC + ACI52["ACI Fab2 Ver. 5.2"] --> F2SVC + end + SL --> PT1514 + +``` + +The above architecture can be achieved with the following config: + +- `name`: This will set the `fabric` labels for the logs received by Loki +- `containerPort`: The port the container listen to. This is mapping a logs stream to a fabric +- `service.type`: I would suggest to set this to either `NodePort` or `LoadBalancer`. Regardless this IP allocated MUST be reachable by all the Fabric Nodes. +- `service.port`: The port the `LoadBalancer` service is listening to, this will be the port you set into the ACI Syslog config. +- `service.nodePort`: The port the `NodePort` service is listening to, this will be the port you set into the ACI Syslog config. + +```yaml +promtail: + extraPorts: + fab1: + name: fab1 + containerPort: 1513 + service: + type: LoadBalancer + port: 1513 + fab2: + name: fab2 + containerPort: 1516 + service: + type: ClusterIP + +syslog: + services: + fab2: + name: fab2 + containerPort: 1516 + protocol: UDP + service: + type: LoadBalancer + port: 1516 +``` + +### ACI Syslog Config +If you need a reminder on how to configure ACI Syslog take a look [Here](syslog.md) + +## Example Config for 4 Fabrics +Here you can see an [Example Config for 4 Fabrics](docs/4-fabric-example.yaml) + +# Chart Deployment + +Once the configuration file is generated i.e.: `aci-mon-stack-config.yaml` Helm can be used to deploy the stack: + +```shell +helm repo add aci-monitoring-stack https://datacenter.github.io/aci-monitoring-stack +helm repo update +helm -n aci-mon-stack upgrade --install --create-namespace aci-mon-stack aci-monitoring-stack/aci-monitoring-stack -f aci-mon-stack-config.yaml \ No newline at end of file diff --git a/docs/images/column-filter.png b/docs/images/column-filter.png new file mode 100644 index 0000000..cd631f5 Binary files /dev/null and b/docs/images/column-filter.png differ diff --git a/docs/images/dashboards.png b/docs/images/dashboards.png new file mode 100644 index 0000000..5bb65f2 Binary files /dev/null and b/docs/images/dashboards.png differ diff --git a/docs/images/fabric-filter.png b/docs/images/fabric-filter.png new file mode 100644 index 0000000..59e8118 Binary files /dev/null and b/docs/images/fabric-filter.png differ diff --git a/docs/images/faults.png b/docs/images/faults.png new file mode 100644 index 0000000..1b62edb Binary files /dev/null and b/docs/images/faults.png differ diff --git a/docs/labs/images/lab1/EmptyDashboard.png b/docs/labs/images/lab1/EmptyDashboard.png new file mode 100644 index 0000000..5af6d54 Binary files /dev/null and b/docs/labs/images/lab1/EmptyDashboard.png differ diff --git a/docs/labs/images/lab1/TableView1.png b/docs/labs/images/lab1/TableView1.png new file mode 100644 index 0000000..ee9c975 Binary files /dev/null and b/docs/labs/images/lab1/TableView1.png differ diff --git a/docs/labs/images/lab1/TimeSeries.png b/docs/labs/images/lab1/TimeSeries.png new file mode 100644 index 0000000..8d4379e Binary files /dev/null and b/docs/labs/images/lab1/TimeSeries.png differ diff --git a/docs/labs/images/lab1/Visualization.png b/docs/labs/images/lab1/Visualization.png new file mode 100644 index 0000000..ab57bd1 Binary files /dev/null and b/docs/labs/images/lab1/Visualization.png differ diff --git a/docs/labs/images/lab1/label-filtering-1.png b/docs/labs/images/lab1/label-filtering-1.png new file mode 100644 index 0000000..343c171 Binary files /dev/null and b/docs/labs/images/lab1/label-filtering-1.png differ diff --git a/docs/labs/images/lab1/label-filtering-dropdown.png b/docs/labs/images/lab1/label-filtering-dropdown.png new file mode 100644 index 0000000..09e8965 Binary files /dev/null and b/docs/labs/images/lab1/label-filtering-dropdown.png differ diff --git a/docs/labs/images/lab1/multiply.png b/docs/labs/images/lab1/multiply.png new file mode 100644 index 0000000..656ab85 Binary files /dev/null and b/docs/labs/images/lab1/multiply.png differ diff --git a/docs/labs/images/lab1/oganize.png b/docs/labs/images/lab1/oganize.png new file mode 100644 index 0000000..7afcea5 Binary files /dev/null and b/docs/labs/images/lab1/oganize.png differ diff --git a/docs/labs/images/lab1/queryformat.png b/docs/labs/images/lab1/queryformat.png new file mode 100644 index 0000000..614d3a2 Binary files /dev/null and b/docs/labs/images/lab1/queryformat.png differ diff --git a/docs/labs/images/lab1/table-wrong-time.png b/docs/labs/images/lab1/table-wrong-time.png new file mode 100644 index 0000000..451664f Binary files /dev/null and b/docs/labs/images/lab1/table-wrong-time.png differ diff --git a/docs/labs/lab1.md b/docs/labs/lab1.md new file mode 100644 index 0000000..d7168d3 --- /dev/null +++ b/docs/labs/lab1.md @@ -0,0 +1,163 @@ +# Overview + +This is an simple lab that that builds a minimal dashboard showing data in a Table Format + +# Access + +The Demo environment is hosted in a DMZ and ca be accessed with the following credentials: + +https://64.104.255.11/ + +user: `guest` +password: `guest` + +The guest user is able to modify the dashboards and run `Explore` queries however it can't save any of the configuration changes. + +# Recreate the ACI Faults Dashboard + +This dashboard is a 1:1 copy of the faults that are present inside ACI. The main advantages compared to looking at the faults in the ACI UI are: +- the ability to aggregating Faults from Multiple Fabrics in a single table +- allowing advanced sorting and filtering + +![faults](../images/faults.png) + +By using the `Fabric` drop down menu you can select different Fabrics (or All) and you can use the Colum headers to filter/sort the data: + + + + +This is a good dashboard to understand how Grafana dashboards are built, so let's re-built the `Fault By Last Transition` table. + +**Note:** In this example we are focusing on Grafana Dashboard, *someone* configured the `aci-exporter` and `Prometheus` to populate the `aci_faults` with data. If you wanna learn on how to configure the `aci-exporter` and `Prometheus` to work together you can check out the [development](development.md) guide. + +## Dashboard Editing + +*Warning:* Since this is an environment open to the internet I have not allowed used to save any config changes so DO NOT close or reload the browser or you will loose your work ! + +- Select `Dashboards` --> `Tests` --> `Dashboard Test 1` --> Move your move over the empty dashboard --> press `e` on the keyboard. This should open up the editing mode. +- Ensure that you have: + - `Prometheus` selected as Data Source + - Selected the `Builder` mode this is a good way to learn but we will also look a the code afterwards. + +![alt text](images/lab1/EmptyDashboard.png) + +- From the Metric Drop Down menu select `aci_faults` and click `Run Query` this will display a Graph, in the legend you can see that for each metric we have infos about the Fabric, cause, description etc... the Metric value itself (the 1.7Bil) is the Unix Time Stamp of the last transition time the fault. + +![alt text](images/lab1/TimeSeries.png) +However this is not a very good visualization for this type of data, we can see interesting data in the legend but a time series is really not the right visualization as we are interested in a list of faults aka a table! + +To switch to a `Table` view we need two steps: +- Select the `Table Format` for our query: Go to `Options` --> `Format` --> Select `Table` + +- Select the `Table` from the Visualization drop down Menu by cluicking on `Time Series` and then picking `Table` (take a moment to see how many options there are here) + + + +- With just these two simple changes the data should look much better already however: + - The `Time` and `created` column are not the last transition time for the Fault but when the fault was first received in `Prometheus` and for our use case this is useless. + - The table contains a few "useless" column that would be nice to hide + - The `Value` (last column) that represent the last transition time for our `Fault` is a long number, not a date + +To solve all these issue we need to manipulate our data, in this example we are going to use there 3 Grafana transformations: + + - `Organize fields by name`: This will allows us to rename, re-order and hide the table columns: + - `Convert field type`: This will allow us to convert the `Value` from a Unix Time Stamp to an actual human readable data + - `Sort by`: To sort our Events by Last Transition time i.e. `Value` + - click on the `Transform Data` tab and select `Add Transformation` + +## Organize fields by name: + +Select `Transform Data` --> `Add Transformation` --> `Organize fields by name` +![alt text](images/lab1/oganize.png) +Here you can +- Change the ordering of the fields, by drag them by the vertical dots on the left +- Hide them, by clicking on the `eye` symbol +- Rename them by adding text in the empty box on the right of the field name + +You are free to sort things as you please but I would recommend to at least: +- Hide: + - Time + - aci + - created + - instance + - job +- Rename: + - `Values` to `Last Transition` + - Place `Last Transition` as first item in the table + +## Convert field type: + +Select `Add another transformation` --> `Convert field type` +- Field: `Last Transition` +- Type: `Time` + +If you have placed the `Last Transition` as the first colum you should now see dates but you probably also notice that are not quite right as they show 1970. +This is due to the fact that the epoch is expected in milliseconds since 1970 but what we are getting is just seconds, we will fix this after for now ignore this. + +## Sort by: + +Select `Add another transformation` --> `Sort By` +- Field: `Last Transition` +- Reverse: Enabled + +Depending on how you have Organize the fields our table should look something like this: + +![alt text](images/lab1/table-wrong-time.png) + +## Fix the `Last Transition` timestamp: + +All we have to do is multiple the `Last Transition` (aka the Value of our Metric) by 1000 + +- Click on `Query` --> `+ Operations` +- In the `Search` tab enter `multiply` and select `Multiply by scalar` +![alt text](images/lab1/multiply.png) +- Set the `Value`: to 1000 +- Click `Run Queries` + +Now the time should be reflected correctly. + +## Switch to Code + +The query Builder is a great tool to learn but as you star building more complex queries it will become too cumbersome to use and some advanced capabilities are also not available so is a good idea to also learn the `PromQL` syntax. Try to click on `Code` and you should see that the same expression can be written as: + +`aci_faults * 1000` + +## Filter by Fabric + +We will do this steps in the Code mode to learn a bit more. `PromQL` support filtering your queriers by the labels, this is super easy, just open a `{` after the metric name and you should see a dropdown menu with all our labels! + + +If you want for example to show Faults only from `site1` you can type the following query `aci_faults{fabric="site1"} * 1000` and now only faults from `site1` should appear. If you want to filter by using Regular Expressions (`RegEx`) you can replace `=` with `=~` we will use this syntax in the next task + +## Filter by Fabric with Dashboards Variables + +Variables in Grafana allow you to create dynamic and interactive dashboards by enabling you to define placeholders that can be replaced with different values at runtime. + +If you are still on the dashboard editing pane let's modify our query to look like this: + +`aci_faults{fabric=~"$fabric"} * 1000` + +The `fabric=~"$fabric"` part simply tells `Grafana` to use the variable `$fabric` in this filter and the `=~` also allow us to treat this filtering expression as a `RegEx` so that we can select 1 or more fabrics at the same time. + +Click apply, this will result in an *empty* dashboard, this is expected since the variable `$fabric` does not exists yet! + +To create the `$fabric` variables to select our `sites` follow these steps: + +- Click on the gear icon (settings in the top right) and select "Variables." +- Click "New variable" + - Select variable type: `Query` + - Name: `fabric` + - Display Name: `Fabric` + - Show on dashboard: `Labels and Values` + - Data source: `Promethesu` + - Query + - Query type: `Labels Values` + - Label: `fabric` **Warning** select the `fabric` label **DO NOT** select `$fabric` + - Selection options: Enabled `Multi-Values` and `Include All option` +- Click `Apply` +- Click `Close` +- If the Dashboard is still empty click the refresh button on the Top Right (the two spinning arrows) + +Now your dashboard will have a new drop down menu where you can dynamically select the fabric to display! + + \ No newline at end of file