Incorrect Container Memory Consumption Graph Behavior When Pod is Restarted #2522

vladmalynych · 2024-09-19T07:47:57Z

Problem:

The Grafana dashboards defined in grafana-dashboardDefinitions.yaml include graphs for memory consumption per pod. The memory consumption query currently used is:

https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/grafana-dashboardDefinitions.yaml#L8300

                  "targets": [
                      {
                          "datasource": {
                              "type": "prometheus",
                              "uid": "${datasource}"
                          },
                          "expr": "sum(container_memory_working_set_bytes{job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"\", image!=\"\"}) by (container)",
                          "legendFormat": "__auto"
                      },
                      {
                          "datasource": {
                              "type": "prometheus",
                              "uid": "${datasource}"
                          },
                          "expr": "sum(\n    kube_pod_container_resource_requests{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", resource=\"memory\"}\n)\n",
                          "legendFormat": "requests"
                      },
                      {
                          "datasource": {
                              "type": "prometheus",
                              "uid": "${datasource}"
                          },
                          "expr": "sum(\n    kube_pod_container_resource_limits{job=\"kube-state-metrics\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", resource=\"memory\"}\n)\n",
                          "legendFormat": "limits"
                      }
                  ],
                  "title": "Memory Usage (WSS)",
                  "type": "timeseries"
              },

When a pod is restarted, the current query adds memory usage data from both the old and new containers simultaneously. This can lead to temporary spikes in the displayed memory consumption. As a result, the dashboard may show memory usage that exceeds the container's memory limit, even though the actual memory consumption is within the limit.

Steps to Reproduce:

Trigger a pod restart (e.g OOM kill, or Evict).
Compare graphs with expression grouped by just container field with graph that has expression that groups by container and id:

"expr": "sum(container_memory_working_set_bytes{job=\"kubelet\", metrics_path=\"/metrics/cadvisor\", cluster=\"$cluster\", namespace=\"$namespace\", pod=\"$pod\", container!=\"\", image!=\"\"}) by (container, id)"

The text was updated successfully, but these errors were encountered:

vladmalynych added the kind/bug label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Container Memory Consumption Graph Behavior When Pod is Restarted #2522

Incorrect Container Memory Consumption Graph Behavior When Pod is Restarted #2522

vladmalynych commented Sep 19, 2024

Incorrect Container Memory Consumption Graph Behavior When Pod is Restarted #2522

Incorrect Container Memory Consumption Graph Behavior When Pod is Restarted #2522

Comments

vladmalynych commented Sep 19, 2024

Problem:

Steps to Reproduce: