The purpose of this service is to aggregate the healthchecks from services and pods in the Kubernetes cluster.
In this section, the aggregate-healthcheck functionalities are described.
A service is considered to be healthy if it has all the pods healthy. To determine which pods are healthy, Aggregate Healthcheck service checks each pod's __health endpoint.
Note that for services are grouped into categories, therefore there is the possibility to query the aggregate-healthcheck only for a certain list of categories. If no category is provided, the health status of all services will be displayed.
The healths of the pods are evaluated by querying the __health endpoint of apps inside the pods. Given a pod, if there is at least one check that fails, the pod health will be considered warning or critical, based on the severity level of the check that fails.
When a service is unhealthy, there is a possibility to acknowledge the warning. By acknowledging all the services that are unhealthy, the general status of the aggregate-healthcheck will become healthy (it will also mention that there are 'n' services acknowledged).
Categories can be sticky, meaning that if one of the services become unhealthy, the category will be disabled, meaning that it will be unhealthy, until manual re-enabling it. There is an endpoint for enabling a category.
To run the service locally, you will need to run the following commands first to get the vendored dependencies for this project:
go build
There is a limited number of functionality that can be used locally, because we are querying all the apps, inside the pods and there is no current solution of accessing them outside of the cluster, without using port-forwarding. The list of all functionality that can be used outside of the cluster are:
- Add/Remove acknowledge
- Enable/Disable sticky categories
To build Docker images for this service, use the following repo: coco/upp-aggregate-healthcheck
For a service to be taken into consideration by aggregate-healthcheck it needs to have the following:
- The Kubernetes service should have hasHealthcheck: "true" label.
- The container should have Kubernetes
readinessProbe
configured to check the__gtg
endpoint of the app - The app should have
__gtg
and__health
endpoints.
Categories are stored in Kubernetes ConfigMaps. The template of a ConfigMap for a category is shown below:
kind: ConfigMap
apiVersion: v1
metadata:
name: category.CATEGORY-NAME # name of the category
labels:
healthcheck-categories-for: aggregate-healthcheck # this flag is used by aggregate-healthcheck service to pick up only ConfigMaps that store categories.
data:
category.name: CATEGORY-NAME # name of the category
category.services: serviceName1, serviceName2, serviceName3 # services that belong to this category
category.refreshrate: "60" # refresh rate in seconds for cache (by default it is 60)
category.issticky: "false" # boolean flag that marks category as sticky. By default this flag is set to false.
category.enabled: "true" # boolean flag that marks category as disabled. By default, this flag is set to true.
In the following section, aggregate-healthcheck endpoints are described. Note that this app has two options of retrieving healthchecks:
JSON format
- to get the results in JSON format, provide the"Accept: application/json"
headerHTML format
- this is the default format of displaying healthchecks.
Note that there is a configurable pathPrefix which will be the prefix of each endpoint's path
(E.g. if the prefix is __health
, the endpoint path for add-ack is __health/add-ack
. The default value for pathPrefix is the empty string.
In the provided examples, it is assumed that the pathPrefix is __health
.)
__gtg
- the GoodToGoo endpoint- params:
categories
- the healthcheck will be performed on the services belonging to the provided categories.cache
- if set to false, the healthchecks will be performed without the help of cache. By default, the cache is used.
- returns a 503 Service Unavailable status code in the following cases:
- if at least one of the provided categories is disabled (see sticky functionality)
- if at least one of the checked services is unhealthy
- returns a 200 OK status code otherwise
- example:
localhost:8080/__gtg?cache=false&categories=read,publish
- params:
<pathPrefix>/__health
or simply<pathPrefix>
- Perform services healthcheck.- params:
categories
- the healthcheck will be performed on the services belonging to the provided categories.cache
- if set to false, the healthchecks will be performed without the help of cache. By default, the cache is used.
- example:
localhost:8080/__health?cache=false&categories=read,publish
- params:
<pathPrefix>/__pods-health
- Perform pods healthcheck for a service.- params:
service-name
- The healthcheck will be performed only for pods belonging to the provided service.
- example:
localhost:8080/__health/__pods-health?service-name=api-policy-component
- params:
<pathPrefix>/__pod-individual-health
- Retrieves the healthchecks of the app running inside the pod.- params:
pod-name
- The name of the pod for which the healthchecks will be retrieved.
- example:
localhost:8080/__health/__pod-individual-health?pod-name=api-policy-component2912-12341
- params:
<pathPrefix>/add-ack
- (POST) Acknowledges a service- params:
service-name
- The service to be acknowledged.
- example:
localhost:8080/__health/add-ack?service-name=api-policy-component
(request body:ack-msg=this is the message for ack
) - request body:
ack-msg
the acknowledge message.
- params:
<pathPrefix>/rem-ack
- Removes the acknowledge of a service- params:
service-name
- The service to be updated.
- example:
localhost:8080/__health/rem-ack?service-name=api-policy-component
- params:
<pathPrefix>/enable-category
- Enables a category. This is used for sticky categories which are unhealthy.- params:
category-name
- The category to be enabled.
- example:
localhost:8080/__health/enable-category?category-name=read
- params:
<pathPrefix>/disable-category
- Disables a category. This is useful when doing a failover.- params:
category-name
- The category to be disabled.
- example:
localhost:8080/__health/disable-category?category-name=read
- params:
-
__health
-
__gtg
main.go call sequence
- controller.go:
initializeController
- calls service.go:
initializeHealthCheckService
- calls service.go:
- prometheusFeeder.go:
newPrometheusFeeder
&feed
listen
starts HTTP server- path
/
-> handler.go:handleServicesHealthCheck
- path
/__pods-health
-> handler.go:handlePodsHealthCheck
- path
/__pod-individual-health
-> handler.go:handleIndividualPodHealthCheck
- path
httpHandler methods
- handler.go
- handleServicesHealthCheck
- controller.go:
buildServicesHealthResult
- sort, format and return healthcheck results
buildServicesCheckHTMLResponse
(format result of buildServicesHealthResult for output)
- controller.go:
- handleServicesHealthCheck
healthCheckController methods
- controller.go
buildServicesHealthResult
getMatchingCategories
(filter categories)- if useCache -> cachingController.go:
collectChecksFromCachesFor
- if NOT useCache ->
runServiceChecksFor
runServiceChecksByServiceNames
runServiceChecksByServiceNames
- get services list from
k8sHealthcheckService.getServicesMapByNames
runServiceChecksByServiceNames
k8sHealthcheckService.getDeployments
- checks health for all services using
go-fthealth.RunCheck
- loops through the result in parallel and calculates the severity of each service using severityController.go:
getSeverityForService
- loops through the services list and updates acks using
updateHealthCheckWithAckMsg
- cachingController.go:
updateCachedHealth
- schedules recurring checks for services not already in
measuredServices
- in practice this means scheduling checks only on startup
- schedules recurring checks for services not already in
- get services list from
disableStickyFailingCategories
- cachingController.go
- severityController.go
getSeverityForService
k8sHealthcheckService.getPodsForService
k8sHealthcheckService methods
- service.go
initializeHealthCheckService
- starts as a Go routine
watchAcks
(load and update the service acks) - starts as a Go routine
watchServices
(load and update the service list)
- starts as a Go routine
watchServices
- using the k8s API gets all services matching
kubectl get services -l hasHealthcheck=true
- prepares them as
service
structures and saves into thek8sHealthcheckService.services.m
map - after all services are processed it logs
Services watching terminated. Reconnecting...
and invokes itself again
- using the k8s API gets all services matching
watchAcks
- using the k8s API gets all (should be only one currently) configmaps matching
kubectl get configmaps -l healthcheck-acknowledgements-for=aggregate-healthcheck
- updates the
service.ack
key of thek8sHealthcheckService.services.m
map - after all acks are processed it logs
Acks configMap watching terminated. Reconnecting..."
and invokes itself again
- using the k8s API gets all (should be only one currently) configmaps matching
getCategories
- using the k8s API gets all configmaps matching
kubectl get configmaps -l healthcheck-categories-for=aggregate-healthcheck
- using the k8s API gets all configmaps matching
getDeployments
- using the k8s API gets all deployment and statefulset names along with their desired replica count
getPodsForService
- using the k8s API gets all pods matching
kubectl get pods -l app=%s
- using the k8s API gets all pods matching