Skip to content

Latest commit

 

History

History
307 lines (227 loc) · 8.77 KB

File metadata and controls

307 lines (227 loc) · 8.77 KB

SMA: Metrics and Dashboards and Charts, Oh My!

This is the documentation - and executable code! - for "Metrics and Dashboards and Charts, oh my!" The easiest way to use this file is to execute it with demosh.

Things in Markdown comments are safe to ignore when reading this later. When executing this with demosh, things after the horizontal rule below (which is just before a commented @SHOW directive) will get displayed.

Make sure that the cluster exists and has Linkerd, Faces, Emissary, and Emojivoto installed.

BAT_STYLE="grid,numbers"
title "Lesson 3.4"

set -e

if [ $(kubectl get ns | grep -c linkerd) -eq 0 ]; then \
    echo "Linkerd is not installed in the cluster" >&2; \
    exit 1 ;\
fi

if [ $(kubectl get ns | grep -c faces) -eq 0 ]; then \
    echo "Faces is not installed in the cluster" >&2; \
    exit 1 ;\
fi

if [ $(kubectl get ns | grep -c emissary) -eq 0 ]; then \
    echo "Emissary is not installed in the cluster" >&2; \
    exit 1 ;\
fi

if [ $(kubectl get ns | grep -c emojivoto) -eq 0 ]; then \
    echo "Emojivoto is not installed in the cluster" >&2; \
    exit 1 ;\
fi

Then it's off to the real work.

#@start_livecast

Metrics and Dashboards and Charts, Oh My!

We're starting out with our cluster already set up with

  • Linkerd
  • Grafana
  • Linkerd Viz (pointing to our Grafana)
  • Faces (using Emissary)
  • Emojivoto

(If you need to set up your cluster, RESET.sh can do it for you!)

All these things are meshed, and we have some Routes installed too:

kubectl get httproute.gateway.networking.k8s.io -A
kubectl get grpcroute.gateway.networking.k8s.io -A

Let's start by looking at the metrics stored in the control plane:

linkerd diagnostics controller-metrics | more

So that already looks kinda like a mess! Let's check out what the proxy stores for us. For this we need to specify which proxy, so we'll tell it to look at some Pod in the face Deployment in the faces namespace:

linkerd diagnostics proxy-metrics -n faces deploy/face | more

This seems like it just goes on forever! Let's try to make a bit more sense of this with promtool. We'll start with a brutal hack to get a port-forward running so that we can talk to Prometheus:

kubectl port-forward -n linkerd-viz svc/prometheus 9090:9090 &

Now we can use promtool to check the metrics. We'll ask it to pull all the time series it can find with a namespace="faces" label as a representative sample:

promtool query series http://localhost:9090 \
         --match '{namespace="faces"}' \
    | more

That's still a massive amount, but suppose we pare down that output to just the names of the metrics?

promtool query series http://localhost:9090 \
         --match '{namespace="faces"}' \
    | sed -e 's/^.*__name__="\([^"][^"]*\)".*$/\1/' | sort -u | more

That's somewhat more manageable, and it actually gives us a place to stand for talking about the metrics in broad classes.

OK, that was a lot. So how can we actually use this stuff?

Remember: you're going to start by figuring out what information you need, then tailoring everything to that.

For this demo, we'll look at HTTP retries -- that's interesting and it's new to Linkerd 2.16, so that should be fun.

The most basic retry info is the outbound_http_route_retry_requests_total metric: that's a counter of total retries, and it has a bunch of labels that we can use to slice and dice the data. We only need four of them, though:

  • deployment and namespace identify the source of the request
  • parent_name and parent_namespace identify the destination
    • (in Gateway API for service mesh, the parent is always the Service to which the request is being sent)

So let's start by trying to get a sense for how many retries are going from Emissary, by using curl to run raw queries against our Prometheus. Specifically we'll first do an 'instantaneous' query, which will return only values for a single moment, and we'll just filter to the emissary namespace and deployment. The query here is actually

outbound_http_route_retry_requests_total{
    namespace="emissary", deployment="emissary"
}

just all on one line.

curl -G http://localhost:9090/api/v1/query \
     --data-urlencode 'query=outbound_http_route_retry_requests_total{namespace="emissary", deployment="emissary"}' \
     | jq | bat -l json

There are a lot of labels in there, and they're kind of getting in our way. We can use the sum function to get rid of most of them -- let's keep just parent_name and parent_namespace:

sum by (parent_name, parent_namespace) (
    outbound_http_route_retry_requests_total{
        namespace="emissary", deployment="emissary"
    }
)
curl -G http://localhost:9090/api/v1/query \
     --data-urlencode 'query=sum by (parent_name, parent_namespace) (outbound_http_route_retry_requests_total{namespace="emissary", deployment="emissary"})' \
     | jq | bat -l json

MUCH better. From this, we can see that the emissary deployment is retrying things only to face in the faces namespace, so let's add more labels to focus on that:

sum by (parent_name, parent_namespace) (
    outbound_http_route_retry_requests_total{
        namespace="emissary", deployment="emissary",
        parent_name="face", parent_namespace="faces"
    }
)
curl -G http://localhost:9090/api/v1/query \
     --data-urlencode 'query=sum by (parent_name, parent_namespace) (outbound_http_route_retry_requests_total{namespace="emissary", deployment="emissary",parent_name="face", parent_namespace="faces"})' \
     | jq | bat -l json

Now let's turn that into a rate, instead of an instantaneous count, using the rate function to get a rate calculated over a one-minute window:

sum by (parent_name, parent_namespace) (
    rate(
        outbound_http_route_retry_requests_total{
            namespace="emissary", deployment="emissary",
            parent_name="face", parent_namespace="faces"
        }[1m]
    )
)
curl -G http://localhost:9090/api/v1/query \
     --data-urlencode 'query=sum by (parent_name, parent_namespace) (rate(outbound_http_route_retry_requests_total{namespace="emissary", deployment="emissary",parent_name="face", parent_namespace="faces"}[1m]))' \
     | jq | bat -l json

Finally, we can ask for a time series of that rate by adding a range to the whole query. In this case we use [5m:1m] to get five minutes of rates, spaced one minute apart:

sum by (parent_name, parent_namespace) (
    rate(
        outbound_http_route_retry_requests_total{
            namespace="emissary", deployment="emissary",
            parent_name="face", parent_namespace="faces"
        }[1m]
    )
)[5m:1m]
curl -G http://localhost:9090/api/v1/query \
     --data-urlencode 'query=sum by (parent_name, parent_namespace) (rate(outbound_http_route_retry_requests_total{namespace="emissary", deployment="emissary",parent_name="face", parent_namespace="faces"}[1m]))[5m:1m]' \
     | jq | bat -l json

This is the basis of anything we want to do. Let's finish this by flipping over to Grafana and building this into the dashboard... which we'll do basically the exact same way.

One More Thing!

Remember I said that there's nothing special about Viz, it's just a Prometheus client? Just to prove that, in our directory here is a Python program called promq.py that displays a running breakdown of some of the gRPC metrics for Emojivoto, without doing any math itself -- it's all just Prometheus queries.

(promq.py also deliberately does everything the hard way instead of using the Python Prometheus client package.)

#@immed
set +e
python promq.py

We're not going to over the code in detail, but it's worth looking quickly at the queries it's running.

bat promq.py

Summary

There's a lot of useful information in the metrics, and even though they look complex, they're actually pretty easy to work with. The key is to start with what you want to know, and then build up from there.

Finally, feedback is always welcome! You can reach me at [email protected] or as @flynn on the Linkerd Slack (https://slack.linkerd.io).