upp-aggregate-healthcheck

The purpose of this service is to aggregate the healthchecks from services and pods in the Kubernetes cluster.

Introduction

In this section, the aggregate-healthcheck functionalities are described.

Get services health

A service is considered to be healthy if it has all the pods healthy. To determine which pods are healthy, Aggregate Healthcheck service checks each pod's __health endpoint.

Note that for services are grouped into categories, therefore there is the possibility to query the aggregate-healthcheck only for a certain list of categories. If no category is provided, the health status of all services will be displayed.

Get pods health for a service

The healths of the pods are evaluated by querying the __health endpoint of apps inside the pods. Given a pod, if there is at least one check that fails, the pod health will be considered warning or critical, based on the severity level of the check that fails.

Acknowledge a service

When a service is unhealthy, there is a possibility to acknowledge the warning. By acknowledging all the services that are unhealthy, the general status of the aggregate-healthcheck will become healthy (it will also mention that there are 'n' services acknowledged).

Sticky categories

Categories can be sticky, meaning that if one of the services become unhealthy, the category will be disabled, meaning that it will be unhealthy, until manual re-enabling it. There is an endpoint for enabling a category.

Running locally

To run the service locally, you will need to run the following commands first to get the vendored dependencies for this project:

go build

There is a limited number of functionality that can be used locally, because we are querying all the apps, inside the pods and there is no current solution of accessing them outside of the cluster, without using port-forwarding. The list of all functionality that can be used outside of the cluster are:

Add/Remove acknowledge
Enable/Disable sticky categories

Build and deployment

To build Docker images for this service, use the following repo: coco/upp-aggregate-healthcheck

How to configure services for aggregate-healthcheck

For a service to be taken into consideration by aggregate-healthcheck it needs to have the following:

The Kubernetes service should have hasHealthcheck: "true" label.
The container should have Kubernetes readinessProbe configured to check the __gtg endpoint of the app
The app should have __gtg and __health endpoints.

How to configure categories for aggregate-healthcheck

Categories are stored in Kubernetes ConfigMaps. The template of a ConfigMap for a category is shown below:

  kind: ConfigMap
      apiVersion: v1
      metadata:
        name: category.CATEGORY-NAME # name of the category
        labels:
          healthcheck-categories-for: aggregate-healthcheck # this flag is used by aggregate-healthcheck service to pick up only ConfigMaps that store categories.
      data:
        category.name: CATEGORY-NAME # name of the category
        category.services: serviceName1, serviceName2, serviceName3 # services that belong to this category
        category.refreshrate: "60" # refresh rate in seconds for cache (by default it is 60)
        category.issticky: "false" # boolean flag that marks category as sticky. By default this flag is set to false.
        category.enabled: "true" # boolean flag that marks category as disabled. By default, this flag is set to true.

Endpoints

In the following section, aggregate-healthcheck endpoints are described. Note that this app has two options of retrieving healthchecks:

JSON format - to get the results in JSON format, provide the "Accept: application/json" header
HTML format - this is the default format of displaying healthchecks.

Service endpoints

Note that there is a configurable pathPrefix which will be the prefix of each endpoint's path (E.g. if the prefix is __health, the endpoint path for add-ack is __health/add-ack. The default value for pathPrefix is the empty string. In the provided examples, it is assumed that the pathPrefix is __health.)

__gtg - the GoodToGoo endpoint
- params:
  - categories - the healthcheck will be performed on the services belonging to the provided categories.
  - cache - if set to false, the healthchecks will be performed without the help of cache. By default, the cache is used.
- returns a 503 Service Unavailable status code in the following cases:
  - if at least one of the provided categories is disabled (see sticky functionality)
  - if at least one of the checked services is unhealthy
- returns a 200 OK status code otherwise
- example: localhost:8080/__gtg?cache=false&categories=read,publish
<pathPrefix>/__health or simply <pathPrefix> - Perform services healthcheck.
- params:
  - categories - the healthcheck will be performed on the services belonging to the provided categories.
  - cache - if set to false, the healthchecks will be performed without the help of cache. By default, the cache is used.
- example: localhost:8080/__health?cache=false&categories=read,publish
<pathPrefix>/__pods-health - Perform pods healthcheck for a service.
- params:
  - service-name - The healthcheck will be performed only for pods belonging to the provided service.
- example: localhost:8080/__health/__pods-health?service-name=api-policy-component
<pathPrefix>/__pod-individual-health - Retrieves the healthchecks of the app running inside the pod.
- params:
  - pod-name - The name of the pod for which the healthchecks will be retrieved.
- example: localhost:8080/__health/__pod-individual-health?pod-name=api-policy-component2912-12341
<pathPrefix>/add-ack - (POST) Acknowledges a service
- params:
  - service-name - The service to be acknowledged.
- example: localhost:8080/__health/add-ack?service-name=api-policy-component (request body: ack-msg=this is the message for ack)
- request body: ack-msg the acknowledge message.
<pathPrefix>/rem-ack - Removes the acknowledge of a service
- params:
  - service-name - The service to be updated.
- example: localhost:8080/__health/rem-ack?service-name=api-policy-component
<pathPrefix>/enable-category - Enables a category. This is used for sticky categories which are unhealthy.
- params:
  - category-name - The category to be enabled.
- example: localhost:8080/__health/enable-category?category-name=read
<pathPrefix>/disable-category - Disables a category. This is useful when doing a failover.
- params:
  - category-name - The category to be disabled.
- example: localhost:8080/__health/disable-category?category-name=read

Admin endpoints

__health
__gtg

Call sequence

main.go call sequence

controller.go: initializeController
- calls service.go: initializeHealthCheckService
prometheusFeeder.go: newPrometheusFeeder & feed
listen starts HTTP server
- path / -> handler.go: handleServicesHealthCheck
- path /__pods-health -> handler.go: handlePodsHealthCheck
- path /__pod-individual-health -> handler.go: handleIndividualPodHealthCheck

httpHandler methods

handler.go
- handleServicesHealthCheck
  - controller.go: buildServicesHealthResult
  - sort, format and return healthcheck results
  - buildServicesCheckHTMLResponse (format result of buildServicesHealthResult for output)

healthCheckController methods

controller.go
- buildServicesHealthResult
  - getMatchingCategories (filter categories)
  - if useCache -> cachingController.go: collectChecksFromCachesFor
  - if NOT useCache -> runServiceChecksFor
  - runServiceChecksByServiceNames
    - runServiceChecksByServiceNames
      - get services list from k8sHealthcheckService.getServicesMapByNames
      - runServiceChecksByServiceNames
        
        k8sHealthcheckService.getDeployments
        
        checks health for all services using go-fthealth.RunCheck
        
        loops through the result in parallel and calculates the severity of each service using severityController.go: getSeverityForService
        
        loops through the services list and updates acks using updateHealthCheckWithAckMsg
        
        cachingController.go: updateCachedHealth
        
        schedules recurring checks for services not already in measuredServices
        
        in practice this means scheduling checks only on startup
  - disableStickyFailingCategories
cachingController.go
severityController.go
- getSeverityForService
  - k8sHealthcheckService.getPodsForService

k8sHealthcheckService methods

service.go
- initializeHealthCheckService
  - starts as a Go routine watchAcks (load and update the service acks)
  - starts as a Go routine watchServices (load and update the service list)
- watchServices
  - using the k8s API gets all services matching kubectl get services -l hasHealthcheck=true
  - prepares them as service structures and saves into the k8sHealthcheckService.services.m map
  - after all services are processed it logs Services watching terminated. Reconnecting... and invokes itself again
- watchAcks
  - using the k8s API gets all (should be only one currently) configmaps matching kubectl get configmaps -l healthcheck-acknowledgements-for=aggregate-healthcheck
  - updates the service.ack key of the k8sHealthcheckService.services.m map
  - after all acks are processed it logs Acks configMap watching terminated. Reconnecting..." and invokes itself again
- getCategories
  - using the k8s API gets all configmaps matching kubectl get configmaps -l healthcheck-categories-for=aggregate-healthcheck
- getDeployments
  - using the k8s API gets all deployment and statefulset names along with their desired replica count
- getPodsForService
  - using the k8s API gets all pods matching kubectl get pods -l app=%s

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
.circleci		.circleci
.github		.github
helm/upp-aggregate-healthcheck		helm/upp-aggregate-healthcheck
html-templates		html-templates
resources		resources
runbooks		runbooks
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
README.md		README.md
cache.go		cache.go
cachingController.go		cachingController.go
checkerService.go		checkerService.go
controller.go		controller.go
controller_test.go		controller_test.go
go.mod		go.mod
go.sum		go.sum
handler.go		handler.go
handler_test.go		handler_test.go
main.go		main.go
model.go		model.go
pacJenkinsfile		pacJenkinsfile
podController.go		podController.go
prometheusFeeder.go		prometheusFeeder.go
prometheusFeeder_test.go		prometheusFeeder_test.go
service.go		service.go
service_test.go		service_test.go
severityController.go		severityController.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

upp-aggregate-healthcheck

Introduction

Get services health

Get pods health for a service

Acknowledge a service

Sticky categories

Running locally

Build and deployment

How to configure services for aggregate-healthcheck

How to configure categories for aggregate-healthcheck

Endpoints

Service endpoints

Admin endpoints

Call sequence

About

Releases 69

Packages

Contributors 33

Languages

Financial-Times/upp-aggregate-healthcheck

Folders and files

Latest commit

History

Repository files navigation

upp-aggregate-healthcheck

Introduction

Get services health

Get pods health for a service

Acknowledge a service

Sticky categories

Running locally

Build and deployment

How to configure services for aggregate-healthcheck

How to configure categories for aggregate-healthcheck

Endpoints

Service endpoints

Admin endpoints

Call sequence

About

Topics

Resources

Stars

Watchers

Forks

Releases 69

Packages 0

Contributors 33

Languages

Packages