osg-htc · mwestphall · Aug 16, 2024 · Aug 19, 2024
diff --git a/docs/other/monitor-kubernetes-kuantifier.md b/docs/other/monitor-kubernetes-kuantifier.md
@@ -0,0 +1,191 @@
+title: Monitor Kubernetes Workloads with Kuantifier
+DateReviewed: 2024-08-16
+
+Monitoring Kubernetes Workloads with Kuantifier
+===============================================
+
+Workload jobs run via Kubernetes will not integrate with [Gratia accounting](./troubleshooting-gratia/) by default.
+To report contributions to OSG made via Kubernetes, the [Kuantifier](kuantifier-github) helm chart can be installed
+into your cluster.
+
+Before Starting
+---------------
+
+### Confirm access to a running Kubernetes cluster
+
+All subsequent instructions assume you have access to a running Kubernetes cluster, and can run [kubectl](kubectl)
+against that cluster.
+
+### Install the Helm command line tools 
+
+Kuantifier itself, and several of its prerequisites, are installed via [helm chart](https://helm.sh/). The helm
+command line tools are used to install helm charts against a running kubernetes cluster, and can be installed
+as follows:
+
+1. Download the latest [helm release](helm-release)
+1. Unpack the release blob (eg. `tar -zxvf helm-v3.0.0-linux-amd64.tar.gz`)
+1. Move the `helm` binary from the archive into a location along your `$PATH` (eg. `mv linux-amd64/helm ~/.local/bin`) 
+
+### Install Prometheus and kube-state-metrics in your Kubernetes cluster
+
+Kuantifier relies on [Prometheus](prometheus) with [kube-state-metrics](kube-state-metrics) to account for pod resource usage. 
+There are a number of ways to install both, such as via the [prometheus community helm charts](prometheus-community):
+
+1. Add the prometheus community helm repository to your local helm 
+
+       :::console
+       helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+       helm repo update
+
+1. Install the kube-state-metrics and prometheus helm charts into your kubernetes cluster.
+
+       :::console
+       helm install kube-state-metrics prometheus-community/kube-state-metrics
+       helm install prometheus prometheus-community/prometheus
+
+       !!! note
+       These commands will install into the `default` namespace by default. To install into a 
+       non-default namespace such as `monitoring`, use the `-n monitoring` flag in helm install.
+
+1. Ensure that the namespace where your workload pods run is properly configured.
+
+    - Kuantifier relies on the `spec.containers[].resources.requests.cpu` field in workload pods
+      to determine processor count for GRACC reporting. Ensure a cpu request is set for pods in
+      your workspace.
+
+    - Kuantifier relies on the Prometheus pod completion time metric to calculate workload job run times.
+      This metric is sometimes missed for pods that are spontaneously deleted, such as those created by
+      Deployments. For best results, run workload pods via Kubernetes Jobs.
+
+    - (Known issue) Kuantifier currently doesn't support calculating usage metrics for workload pods
+      running multiple containers. Ensure that workload pods in your namespace have only one container.
+
+Installation
+------------
+
+Kuantifier itself is also installed via a helm chart, hosted at hub.opensciencegrid.org/iris-hep/kuantifier.
+
+
+### Configuring Kuantifier's Values File
+
+Several instance-specific modifications to the default [Values File](values-file) provided with the chart 
+must be made prior to installation. For full documentation of the values in the values file, see the 
+[helm chart README on Github](helm-values-readme).
+
+1. Fetch the default values.yaml for kuantifier. This file can be obtained in several ways.
+    - Via the helm cli:
+
+          :::console
+          helm show values oci://hub.opensciencegrid.org/iris-hep/kuantifier
+
+    - Via the [kuantifier Github repository](values-github).
+
+          !!! note
+          Ensure that the release tag for values.yaml in the git repo corresponds to the version of the chart you're installing.
+
+
+1. Update the top-level `.outputFormat` in values.yaml to output records to [GRACC](https://gracc.opensciencegrid.org/)
+
+       :::yaml
+       outputFormat: "gratia"
+
+1. Update the `.processor.config` map with the details of your deployment.
+    - All of the following need to be set:
+        - `NAMESPACE`: The namespace of the pods for which Kuantifier will collect and report metrics.
+
+              !!! note
+              Each installation of kuantifier only reports on pods in a single namespace. You must
+              install multiple instances of the chart to support reporting on multiple namespaces.
+
+        - `SITE_NAME`: The name of the site being reported.
+        - `SUBMIT_HOST`: Uniquely identifying name for the Kubernetes cluster where your workload pods run, in FQDN format.
+        - `VO_NAME`: Virtual Organization (VO) of jobs.
+
+    - Additionally, the following may need to be set:
+        - `PROMETHEUS_SERVER`: The DNS name of the prometheus server installed in your kubernetes cluster. 
+            - If Prometheus was installed in your cluster via the prometheus community helm chart in the monitoring
+              namespace, the DNS name will be `prometheus-server.monitoring.svc.cluster.local` 
+            - Otherwise, [construct](https://kubernetes.io/docs/concepts/services-networking/service/#dns) the URL based on the standard Kubernetes service discovery mechanism (i.e. service name and namespace).
+
+    - A fully configured `.processor.config` might look like:
+
+           :::yaml
+           processor:
+             config:
+               NAMESPACE: workload-namespace
+               SITE_NAME: CHTC
+               VO_NAME: University of Wisconsin
+               SUBMIT_HOST: tiger-cluster.chtc.wisc.edu
+               PROMETHEUS_SERVER: prometheus-server.monitoring.svc.cluster.local
+
+1. (Optional) If Prometheus in your cluster is configured to require authentication, an
+   authentication header can be specified via a key within an already-existing [Secret](kubernetes-secret) in the namespace:
+
+       :::yaml
+       processor:
+         prometheus_auth:
+           secret: <secret name>
+           key: <key in secret containing auth header>
+
+1. (Optional) Update the frequency of the Kuantifier Reporting job. This may be useful for debugging.
+
+       :::yaml
+       cronJob:
+         schedule: "@daily"
+
+### Installing Kuantifier
+
+After configuring an appropriate values file for your instance, install the chart via helm:
+
+    :::console
+    helm install -f <values.yaml> -n <install namespace> kuantifier oci://hub.opensciencegrid.org/iris-hep/kuantifier
+
+Validation
+----------
+
+After running helm install, ensure that the expected kubernetes objects have been created. The following commands assume
+that kuantifier has been installed in the monitoring namespace.
+
+1. Check that a CronJob was created for running the kuantifier processor:
+
+       :::console
+       kubectl -n monitoring get cronjob kuantifier-cronjob
+
+1. Check that a ConfigMap was created to configure processor jobs, and that the values in the ConfigMap
+   align with the values set in `.processor.config` in the values file:
+
+       :::console
+       kubectl -n monitoring get configmap kuantifier-processor-config -o yaml
+
+
+If the helm chart artifacts are present as expected, run a test instance of the CronJob and inspect its output.
+
+1. Create a new job from the CronJob, then find the Pod created by the job
+
+       :::console
+       kubectl -n monitoring create job --from=cronjob/kuantifier-cronjob kuantifier-test-job
+       kubectl -n monitoring get pod | grep kuantifier-test-job
+
+1. Inspect the logs from the processor initContainer, which queries prometheus to generate output records.
+
+       :::console
+       kubectl -n monitoring logs <test-job-pod-name> -c processor
+
+1. Inspect the logs from the gratia-output container, which sends the output records to GRACC.
+
+       :::console
+       kubectl -n monitoring logs <test-job-pod-name> -c gratia-output
+
+If both the processor initContainer and gratia-output container run to completion without error, the next step
+is to confirm with a member of the OSG technology team that the results are visible in GRACC.
+
+
+[kuantifier-github]: <https://github.com/rptaylor/kapel/>
+[helm-values-readme]: <https://github.com/rptaylor/kapel/blob/master/chart/README.md>
+[values-yaml]: <https://github.com/rptaylor/kapel/blob/master/chart/values.yaml>
+[values-file]: <https://helm.sh/docs/chart_template_guide/values_files/>
+[prometheus-community]: <https://github.com/prometheus-community/helm-charts/tree/main>
+[kubectl]: <https://kubernetes.io/docs/reference/kubectl/>
+[prometheus]: <https://prometheus.io/>
+[kube-state-metrics]: <https://github.com/kubernetes/kube-state-metrics>
+[kubernetes-secret]: <https://kubernetes.io/docs/concepts/configuration/secret/>