diff --git a/docs/other/monitor-kubernetes-kuantifier.md b/docs/other/monitor-kubernetes-kuantifier.md new file mode 100644 index 000000000..b872a2ace --- /dev/null +++ b/docs/other/monitor-kubernetes-kuantifier.md @@ -0,0 +1,191 @@ +title: Monitor Kubernetes Workloads with Kuantifier +DateReviewed: 2024-08-16 + +Monitoring Kubernetes Workloads with Kuantifier +=============================================== + +Workload jobs run via Kubernetes will not integrate with [Gratia accounting](./troubleshooting-gratia/) by default. +To report contributions to OSG made via Kubernetes, the [Kuantifier](kuantifier-github) helm chart can be installed +into your cluster. + +Before Starting +--------------- + +### Confirm access to a running Kubernetes cluster + +All subsequent instructions assume you have access to a running Kubernetes cluster, and can run [kubectl](kubectl) +against that cluster. + +### Install the Helm command line tools + +Kuantifier itself, and several of its prerequisites, are installed via [helm chart](https://helm.sh/). The helm +command line tools are used to install helm charts against a running kubernetes cluster, and can be installed +as follows: + +1. Download the latest [helm release](helm-release) +1. Unpack the release blob (eg. `tar -zxvf helm-v3.0.0-linux-amd64.tar.gz`) +1. Move the `helm` binary from the archive into a location along your `$PATH` (eg. `mv linux-amd64/helm ~/.local/bin`) + +### Install Prometheus and kube-state-metrics in your Kubernetes cluster + +Kuantifier relies on [Prometheus](prometheus) with [kube-state-metrics](kube-state-metrics) to account for pod resource usage. +There are a number of ways to install both, such as via the [prometheus community helm charts](prometheus-community): + +1. Add the prometheus community helm repository to your local helm + + :::console + helm repo add prometheus-community https://prometheus-community.github.io/helm-charts + helm repo update + +1. Install the kube-state-metrics and prometheus helm charts into your kubernetes cluster. + + :::console + helm install kube-state-metrics prometheus-community/kube-state-metrics + helm install prometheus prometheus-community/prometheus + + !!! note + These commands will install into the `default` namespace by default. To install into a + non-default namespace such as `monitoring`, use the `-n monitoring` flag in helm install. + +1. Ensure that the namespace where your workload pods run is properly configured. + + - Kuantifier relies on the `spec.containers[].resources.requests.cpu` field in workload pods + to determine processor count for GRACC reporting. Ensure a cpu request is set for pods in + your workspace. + + - Kuantifier relies on the Prometheus pod completion time metric to calculate workload job run times. + This metric is sometimes missed for pods that are spontaneously deleted, such as those created by + Deployments. For best results, run workload pods via Kubernetes Jobs. + + - (Known issue) Kuantifier currently doesn't support calculating usage metrics for workload pods + running multiple containers. Ensure that workload pods in your namespace have only one container. + +Installation +------------ + +Kuantifier itself is also installed via a helm chart, hosted at hub.opensciencegrid.org/iris-hep/kuantifier. + + +### Configuring Kuantifier's Values File + +Several instance-specific modifications to the default [Values File](values-file) provided with the chart +must be made prior to installation. For full documentation of the values in the values file, see the +[helm chart README on Github](helm-values-readme). + +1. Fetch the default values.yaml for kuantifier. This file can be obtained in several ways. + - Via the helm cli: + + :::console + helm show values oci://hub.opensciencegrid.org/iris-hep/kuantifier + + - Via the [kuantifier Github repository](values-github). + + !!! note + Ensure that the release tag for values.yaml in the git repo corresponds to the version of the chart you're installing. + + +1. Update the top-level `.outputFormat` in values.yaml to output records to [GRACC](https://gracc.opensciencegrid.org/) + + :::yaml + outputFormat: "gratia" + +1. Update the `.processor.config` map with the details of your deployment. + - All of the following need to be set: + - `NAMESPACE`: The namespace of the pods for which Kuantifier will collect and report metrics. + + !!! note + Each installation of kuantifier only reports on pods in a single namespace. You must + install multiple instances of the chart to support reporting on multiple namespaces. + + - `SITE_NAME`: The name of the site being reported. + - `SUBMIT_HOST`: Uniquely identifying name for the Kubernetes cluster where your workload pods run, in FQDN format. + - `VO_NAME`: Virtual Organization (VO) of jobs. + + - Additionally, the following may need to be set: + - `PROMETHEUS_SERVER`: The DNS name of the prometheus server installed in your kubernetes cluster. + - If Prometheus was installed in your cluster via the prometheus community helm chart in the monitoring + namespace, the DNS name will be `prometheus-server.monitoring.svc.cluster.local` + - Otherwise, [construct](https://kubernetes.io/docs/concepts/services-networking/service/#dns) the URL based on the standard Kubernetes service discovery mechanism (i.e. service name and namespace). + + - A fully configured `.processor.config` might look like: + + :::yaml + processor: + config: + NAMESPACE: workload-namespace + SITE_NAME: CHTC + VO_NAME: University of Wisconsin + SUBMIT_HOST: tiger-cluster.chtc.wisc.edu + PROMETHEUS_SERVER: prometheus-server.monitoring.svc.cluster.local + +1. (Optional) If Prometheus in your cluster is configured to require authentication, an + authentication header can be specified via a key within an already-existing [Secret](kubernetes-secret) in the namespace: + + :::yaml + processor: + prometheus_auth: + secret: + key: + +1. (Optional) Update the frequency of the Kuantifier Reporting job. This may be useful for debugging. + + :::yaml + cronJob: + schedule: "@daily" + +### Installing Kuantifier + +After configuring an appropriate values file for your instance, install the chart via helm: + + :::console + helm install -f -n kuantifier oci://hub.opensciencegrid.org/iris-hep/kuantifier + +Validation +---------- + +After running helm install, ensure that the expected kubernetes objects have been created. The following commands assume +that kuantifier has been installed in the monitoring namespace. + +1. Check that a CronJob was created for running the kuantifier processor: + + :::console + kubectl -n monitoring get cronjob kuantifier-cronjob + +1. Check that a ConfigMap was created to configure processor jobs, and that the values in the ConfigMap + align with the values set in `.processor.config` in the values file: + + :::console + kubectl -n monitoring get configmap kuantifier-processor-config -o yaml + + +If the helm chart artifacts are present as expected, run a test instance of the CronJob and inspect its output. + +1. Create a new job from the CronJob, then find the Pod created by the job + + :::console + kubectl -n monitoring create job --from=cronjob/kuantifier-cronjob kuantifier-test-job + kubectl -n monitoring get pod | grep kuantifier-test-job + +1. Inspect the logs from the processor initContainer, which queries prometheus to generate output records. + + :::console + kubectl -n monitoring logs -c processor + +1. Inspect the logs from the gratia-output container, which sends the output records to GRACC. + + :::console + kubectl -n monitoring logs -c gratia-output + +If both the processor initContainer and gratia-output container run to completion without error, the next step +is to confirm with a member of the OSG technology team that the results are visible in GRACC. + + +[kuantifier-github]: +[helm-values-readme]: +[values-yaml]: +[values-file]: +[prometheus-community]: +[kubectl]: +[prometheus]: +[kube-state-metrics]: +[kubernetes-secret]: