diff --git a/gdi/opentelemetry/collector-kubernetes/collector-kubernetes-intro.rst b/gdi/opentelemetry/collector-kubernetes/collector-kubernetes-intro.rst index 8446d76e4..7ba732d95 100644 --- a/gdi/opentelemetry/collector-kubernetes/collector-kubernetes-intro.rst +++ b/gdi/opentelemetry/collector-kubernetes/collector-kubernetes-intro.rst @@ -22,8 +22,7 @@ Get started with the Collector for Kubernetes Default Kubernetes metrics Upgrade Uninstall - Troubleshoot - Troubleshoot containers + Troubleshoot Support Tutorial: Monitor your Kubernetes environment Tutorial: Configure the Collector for Kubernetes @@ -75,8 +74,7 @@ To upgrade or uninstall, see: If you have any installation or configuration issues, refer to: -* :ref:`otel-troubleshooting` -* :ref:`troubleshoot-k8s` +* :ref:`troubleshoot-k8s-landing` * :ref:`kubernetes-support` .. raw:: html diff --git a/gdi/opentelemetry/collector-kubernetes/k8s-infrastructure-tutorial/about-k8s-tutorial.rst b/gdi/opentelemetry/collector-kubernetes/k8s-infrastructure-tutorial/about-k8s-tutorial.rst index e549b8765..064917c44 100644 --- a/gdi/opentelemetry/collector-kubernetes/k8s-infrastructure-tutorial/about-k8s-tutorial.rst +++ b/gdi/opentelemetry/collector-kubernetes/k8s-infrastructure-tutorial/about-k8s-tutorial.rst @@ -16,7 +16,7 @@ Tutorial: Monitor your Kubernetes environment in Splunk Observability Cloud k8s-monitor-with-navigators k8s-activate-detector -Deploy the Splunk Distribution of OpenTelemetry Collector in a Kubernetes cluster and start monitoring your Kubernetes platform using Splunk Observability Cloud. +Deploy the Splunk Distribution of the OpenTelemetry Collector in a Kubernetes cluster and start monitoring your Kubernetes platform using Splunk Observability Cloud. .. raw:: html diff --git a/gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s-container.rst b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-container.rst similarity index 88% rename from gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s-container.rst rename to gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-container.rst index 1e45cf9d0..0061ae755 100644 --- a/gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s-container.rst +++ b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-container.rst @@ -1,41 +1,19 @@ .. _troubleshoot-k8s-container: *************************************************************** -Troubleshoot the Collector for Kubernetes containers +Troubleshoot Kubernetes and container runtime compatibility *************************************************************** .. meta:: - :description: Describes troubleshooting specific to the Collector for Kubernetes containers. + :description: Describes troubleshooting specific to Kubernetes and container runtime compatibility. -.. note:: For general troubleshooting, see :ref:`otel-troubleshooting` and :ref:`troubleshoot-k8s`. +.. note:: + + See also: -Verify if your container is running out of memory -======================================================================= - -Even if you didn't provide enough resources for the Collector containers, under normal circumstances the Collector doesn't run out of memory (OOM). This can only happen if the Collector is heavily throttled by the backend and exporter sending queue growing faster than collector can control memory utilization. In that case you should see ``429`` errors for metrics and traces or ``503`` errors for logs. - -For example: - -.. code-block:: - - 2021-11-12T00:22:32.172Z info exporterhelper/queued_retry.go:325 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "sapm", "error": "server responded with 429", "interval": "4.4850027s"} - 2021-11-12T00:22:38.087Z error exporterhelper/queued_retry.go:190 Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "name": "sapm", "dropped_items": 1348} - -If you can't fix throttling by bumping limits on the backend or reducing amount of data sent through the Collector, you can avoid OOMs by reducing the sending queue of the failing exporter. For example, you can reduce ``sending_queue`` for the ``sapm`` exporter: - -.. code-block:: yaml - - agent: - config: - exporters: - sapm: - sending_queue: - queue_size: 512 - -You can apply a similar configuration to any other failing exporter. - -Kubernetes and container runtime compatibility -============================================================================================= + * :ref:`troubleshoot-k8s-general` + * :ref:`troubleshoot-k8s-sizing` + * :ref:`troubleshoot-k8s-missing-metrics` Kubernetes requires you to install a container runtime on each node in the cluster so that pods can run there. The Splunk Distribution of the Collector for Kubernetes supports container runtimes such as containerd, CRI-O, Docker, and Mirantis Kubernetes Engine (formerly Docker Enterprise/UCP). @@ -52,7 +30,7 @@ For more information about runtimes, see :new-page:`Container runtime + Sizing + Missing metrics + Container runtime compatibility + + +To troubleshoot the Splunk Distribution of the OpenTelemetry Collector for Kubernetes see: + +* :ref:`troubleshoot-k8s` +* :ref:`troubleshoot-k8s-sizing` +* :ref:`troubleshoot-k8s-missing-metrics` +* :ref:`troubleshoot-k8s-container` + + diff --git a/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-missing-metrics.rst b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-missing-metrics.rst new file mode 100644 index 000000000..c3c4df16a --- /dev/null +++ b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-missing-metrics.rst @@ -0,0 +1,90 @@ +.. _troubleshoot-k8s-missing-metrics: + +*************************************************************** +Troubleshoot missing metrics +*************************************************************** + +.. meta:: + :description: Describes troubleshooting specific to missing metrics in the Collector for Kubernetes. + +.. note:: + + See also: + + * :ref:`troubleshoot-k8s-general` + * :ref:`troubleshoot-k8s-sizing` + * :ref:`troubleshoot-k8s-container` + +The Splunk Collector for Kubernetes is missing metrics starting with ``k8s.pod.*`` and ``k8s.node.*`` +======================================================================================================== + +After deploying the Splunk Distribution of the OpenTelemetry Collector for Kubernetes Chart version 0.87.0 or higher as either a new install or upgrade the following pod and node metrics are not being collected: + +* ``k8s.(pod/node).cpu.time`` +* ``k8s.(pod/node).cpu.utilization`` +* ``k8s.(pod/node).filesystem.available`` +* ``k8s.(pod/node).filesystem.capacity`` +* ``k8s.(pod/node).filesystem.usage`` +* ``k8s.(pod/node).memory.available`` +* ``k8s.(pod/node).memory.major_page_faults`` +* ``k8s.(pod/node).memory.page_faults`` +* ``k8s.(pod/node).memory.rss`` +* ``k8s.(pod/node).memory.usage`` +* ``k8s.(pod/node).memory.working_set`` +* ``k8s.(pod/node).network.errors`` +* ``k8s.(pod/node).network.io`` + +Confirm the metrics are missing +-------------------------------------------------------------------- + +To confirm these metrics are missing perform the following steps: + +1. Confirm that the metrics are missing with the following Splunk Search Processing Language (SPL) command: + +.. code-block:: + + | mstats count(_value) as "Val" where index="otel_metrics_0_93_3" AND metric_name IN (k8s.pod.*, k8s.node.*) by metric_name + +2. Check the Collector's pod logs from the CLI of the Kubernetes node with this command: + +.. code-block:: + + kubectl -n {namespace} logs {collector-agent-pod-name} + +Note: Update ``namespace`` and ``collector-agent-pod-name`` based on your environment. + +3. You will see a "tls: failed to verify certificate" error similar to the one below in the agent pod logs: + +.. code-block:: + + 2024-02-28T01:11:24.614Z error scraperhelper/scrapercontroller.go:200 Error scraping metrics {"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://10.202.38.255:10250/stats/summary\": tls: failed to verify certificate: x509: cannot validate certificate for 10.202.38.255 because it doesn't contain any IP SANs", "scraper": "kubeletstats"} + go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport + go.opentelemetry.io/collector/receiver@v0.93.0/scraperhelper/scrapercontroller.go:200 + go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1 + go.opentelemetry.io/collector/receiver@v0.93.0/scraperhelper/scrapercontroller.go:176 + +Resolution +-------------------------------------------------------------------- + +The :ref:`kubelet-stats-receiver` collects k8s.(pod or node) metrics from the Kubernetes endpoint ``/stats/summary``. As of version 0.87.0 of the Splunk OTel Collector the kubelet certificate is verified during this process to confirm it's valid. If you are using a self signed or invalid certificate the Kubelet stats receiver cannot collect the metrics. + +You have two alternatives to resolve this error: + +1. Add valid a certificate to your Kubernetes cluster. See how at :ref:`otel-kubernetes-config`. After updating the ``values.yaml`` file use the Helm upgrade command to upgrade your Collector deployment. + +2. Disable certificate verification in the OTel agent Kubelet Stats receiver by setting ``insecure_skip_verify: true`` for the Kubelet stats receiver in the agent.config section of the values.yaml. + +For example, use the configuration below to disable certificate verification: + +.. code-block:: + + agent: + config: + receivers: + kubeletstats: + insecure_skip_verify: true + +.. caution:: Keep in mind your security requirements before disabling certificate verification. + + + diff --git a/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-sizing.rst b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-sizing.rst new file mode 100644 index 000000000..76f95ee2e --- /dev/null +++ b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-sizing.rst @@ -0,0 +1,68 @@ +.. _troubleshoot-k8s-sizing: + +*************************************************************** +Troubleshoot sizing for the Collector for Kubernetes +*************************************************************** + +.. meta:: + :description: Describes troubleshooting specific to sizing the Collector for Kubernetes containers. + +.. note:: + + See also: + + * :ref:`troubleshoot-k8s-general` + * :ref:`troubleshoot-k8s-missing-metrics` + * :ref:`troubleshoot-k8s-container` + +Size your Collector instance +============================================================================================= + +Set the resources allocated to your Collector instance based on the amount of data you expecte to handle. For more information, see :ref:`otel-sizing`. + +Use the following configuration to bump resource limits for the agent: + +.. code-block:: yaml + + agent: + resources: + limits: + cpu: 500m + memory: 1Gi + +Set the resources allocated to your cluster receiver deployment based on the cluster size. For example, for a cluster with 100 nodes alllocate these resources: + +.. code-block:: yaml + + clusterReceiver: + resources: + limits: + cpu: 1 + memory: 2Gi + + +Verify if your container is running out of memory +======================================================================= + +Even if you didn't provide enough resources for the Collector containers, under normal circumstances the Collector doesn't run out of memory (OOM). This can only happen if the Collector is heavily throttled by the backend and exporter sending queue growing faster than collector can control memory utilization. In that case you should see ``429`` errors for metrics and traces or ``503`` errors for logs. + +For example: + +.. code-block:: + + 2021-11-12T00:22:32.172Z info exporterhelper/queued_retry.go:325 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "sapm", "error": "server responded with 429", "interval": "4.4850027s"} + 2021-11-12T00:22:38.087Z error exporterhelper/queued_retry.go:190 Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "name": "sapm", "dropped_items": 1348} + +If you can't fix throttling by bumping limits on the backend or reducing amount of data sent through the Collector, you can avoid OOMs by reducing the sending queue of the failing exporter. For example, you can reduce ``sending_queue`` for the ``sapm`` exporter: + +.. code-block:: yaml + + agent: + config: + exporters: + sapm: + sending_queue: + queue_size: 512 + +You can apply a similar configuration to any other failing exporter. + diff --git a/gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s.rst b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s.rst similarity index 60% rename from gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s.rst rename to gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s.rst index dac2f70c9..d9d161976 100644 --- a/gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s.rst +++ b/gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s.rst @@ -1,19 +1,23 @@ .. _tshoot-k8s-container-runtimes: .. _troubleshoot-k8s: +.. _troubleshoot-k8s-general: *************************************************************** -Troubleshoot the Collector for Kubernetes +General troubleshooting for the Collector for Kubernetes *************************************************************** .. meta:: :description: Describes troubleshooting specific to the Collector for Kubernetes. .. note:: - - For general troubleshooting, see :ref:`otel-troubleshooting`. - To troubleshoot issues with your Kubernetes containers, see :ref:`troubleshoot-k8s-container`. + + See also: -Debug logging for the Splunk Otel Collector in Kubernetes + * :ref:`troubleshoot-k8s-sizing` + * :ref:`troubleshoot-k8s-missing-metrics` + * :ref:`troubleshoot-k8s-container` + +Debug logging for the Splunk Opentelemetry Collector in Kubernetes ============================================================================================= You can change the logging level of the Collector from ``info`` to ``debug`` to help you troubleshoot. @@ -55,27 +59,4 @@ To view logs, use: kubectl logs {splunk-otel-collector-agent-pod} -Size your Collector instance -============================================================================================= - -Set the resources allocated to your Collector instance based on the amount of data you expecte to handle. For more information, see :ref:`otel-sizing`. - -Use the following configuration to bump resource limits for the agent: - -.. code-block:: yaml - - agent: - resources: - limits: - cpu: 500m - memory: 1Gi - -Set the resources allocated to your cluster receiver deployment based on the cluster size. For example, for a cluster with 100 nodes alllocate these resources: - -.. code-block:: yaml - clusterReceiver: - resources: - limits: - cpu: 1 - memory: 2Gi diff --git a/gdi/opentelemetry/components/kubelet-stats-receiver.rst b/gdi/opentelemetry/components/kubelet-stats-receiver.rst index f404178bb..5a88630bc 100644 --- a/gdi/opentelemetry/components/kubelet-stats-receiver.rst +++ b/gdi/opentelemetry/components/kubelet-stats-receiver.rst @@ -198,6 +198,8 @@ For example, to collect only node and pod metrics from the receiver: - node - pod +.. _kubelet-stats-receiver-optional-parameters: + Configure optional parameters --------------------------------------------------------------