Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POD Networking metrics missing with crio-1.30 and kubelet-1.30 #3577

Open
rkojedzinszky opened this issue Aug 10, 2024 · 11 comments
Open

POD Networking metrics missing with crio-1.30 and kubelet-1.30 #3577

rkojedzinszky opened this issue Aug 10, 2024 · 11 comments

Comments

@rkojedzinszky
Copy link

It seems that eac1257 breaks network metric collection. At least, with crio-1.30 with defaults.

If I apply the simple diff, I can get container metrics again:

--- a/vendor/github.com/google/cadvisor/container/crio/handler.go
+++ b/vendor/github.com/google/cadvisor/container/crio/handler.go
@@ -155,6 +155,7 @@ func newCrioContainerHandler(
        // reported. This stops metrics being reported multiple times for each
        // container in a pod.
        metrics := common.RemoveNetMetrics(includedMetrics, cInfo.Labels["io.kubernetes.container.name"] != "POD")
+       metrics = includedMetrics
 
        libcontainerHandler := containerlibcontainer.NewHandler(cgroupManager, rootFs, cInfo.Pid, metrics)

The command used to test this:

curl -s -H "Authorization: bearer $TOKEN" -k https://192.168.111.122:10250/metrics/cadvisor | grep "^container_network_transmit_packets_total"

Perhaps, cri-o does not have or keep a POD named container?

@rkojedzinszky
Copy link
Author

It seems that, if I set drop_infra_ctr = false in crio.conf, it also solves the problem.

@kwilczynski
Copy link

Cc @kolyshkin, the author of the mentioned Pull Request.

@kolyshkin
Copy link
Contributor

From a cursory look, commit eac1257 does not change the metrics being reported. If you take a look at the (removed) needNet method and its usage in the code before the commit, you will see that the metrics were dropped (after collecting) using the same criterion (checking if "io.kubernetes.container.name" label is "POD").

@kolyshkin
Copy link
Contributor

To fix the issue, I guess the criteria used for excluding network metrics (i.e. the second argument of common.RemoveNetMetrics) should be different depending on whether cri-o uses infra container or not. Perhaps @haircommander can shed more light.

@sohankunkerkar
Copy link

I think this piece of code will not work for CRI-O. If the infra container has empty network metrics, the crio handler uses a running container in the pod to gather the metrics. Therefore, metrics must be collected from all containers to ensure that if there's a running container in the podthe necessary metrics are gathered.

@kolyshkin
Copy link
Contributor

I think this piece of code will not work for CRI-O. If the infra container has empty network metrics, the crio handler uses a running container in the pod to gather the metrics. Therefore, metrics must be collected from all containers to ensure that if there's a running container in the podthe necessary metrics are gathered.

This is what I meant in the comment above -- find a way to see if infra container is used, and fix the second argument to common.RemoveNetMetrics accordingly.

@haircommander
Copy link
Contributor

cri-o should still be creating an empty cgroup for the infra container so cadvisor is aware of it, and then reporting the PID of the infra container as being one of the other containers in the pod (so network metrics can be collected). It's possible something in that broke, we should make sure cadvisor is sees the infra container cgroup (and that cri-o create it)

@haircommander
Copy link
Contributor

I actually think ca820b6 would fix this. @iwankgb do you think it'd be possible to cherry-pick this to a 0.49.1 (that we create after the pick) so we can pull into kubernetes 1.30?

@iwankgb
Copy link
Collaborator

iwankgb commented Sep 7, 2024

@haircommander I would love to help, but I am not able to cut a release. Someone from Google (@bobbypage, is this still you?) needs to do this.

@rkojedzinszky
Copy link
Author

Any updates on this?

@haircommander
Copy link
Contributor

we still need a new release cut. @bobbypage are you able to do that or grant @iwankgb the ability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants