Healthchecks #7686

planetf1 · 2023-05-18T17:44:15Z

planetf1
May 18, 2023
Maintainer

In the workgroup call yesterday we talked briefly about healthchecks.

I started off capturing some output and summarizing the behaviour. Since this was a first-pass to add into the documentation I worked in markdown. I haven't spent long on it yet

There's a PR at odpi/egeria-docs#775

However it reminded me immediately of the significant issue we have with our current status calls - even before we consider what 'ready' means, and how to rollup status.

One of the most used Kubernetes healthcheck approaches, is to make a http request to a specific endpoint. If it returns >=200 and <400 life is good - pass. If not/timeout - fail. This is simple to define, and would usually work well.

Egeria though -- assuming the platform is active -- will always return HTTP/200, and inspection of the body is required to determine the real result.

Healthchecks can be defined to issue a request within the container. One could imagine this being a script, which would do the interpretation, and return the simpler result we are looking for, but this is (imo) ugly.

A much better approach would be to support simpler http requests aligned with the way many applications work, but this is different to our standard style. Perhaps it could be more flexible in our simple server launcher (vs the chassis)

We could potentially take a dual-track here:
a) define the appropriate command which could - however ugly - be used within an 'exec' k8s healthcheck . This could be used today and just needs a little experimentation. I'll take a look, and see if I can add into these docs
a1) extension: do this via a simple cli tool (script or java). This could also be done fairly quickly.
b) Implement simple http requests which are k8s healthcheck friendly - at least in our server launcher . This is cleaner, but could take significantly longer.

Note - There is a feature open against kubernetes to add some body matching - kubernetes/kubernetes#55405 which also refers to an example in healthcare.. and the same workaround I mention here

cc: @juergenhemelt @davidradl

planetf1 · 2023-05-18T17:47:46Z

planetf1
May 18, 2023
Maintainer Author

Here's one example -- in this case the server is not known. This particular issue is less relevant with a server launcher - but other errors like permissions would be the same.

Note that the real info we need is in relatedHTTPCode.

➜  ~ http --verify=no --pretty=format GET "https://44623abc-eu-gb.lb.appdomain.cloud:9443/open-metadata/admin-services/users/admin/servers/cocoMDS99/instance/status"
HTTP/1.1 200
Connection: keep-alive
Content-Type: application/json
Date: Thu, 18 May 2023 17:08:15 GMT
Keep-Alive: timeout=60
Transfer-Encoding: chunked

{
    "actionDescription": "getActiveServerStatus",
    "class": "OMAGServerStatusResponse",
    "exceptionClassName": "org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException",
    "exceptionErrorMessage": "OMAG-MULTI-TENANT-404-001 The OMAG Server cocoMDS99 is not available to service a request from user admin",
    "exceptionErrorMessageId": "OMAG-MULTI-TENANT-404-001",
    "exceptionErrorMessageParameters": [
        "cocoMDS99",
        "admin"
    ],
    "exceptionProperties": {
        "parameterName": "serverName",
        "serverName": "cocoMDS99"
    },
    "exceptionSystemAction": "The system is unable to process the request because the server is not running on the called platform.",
    "exceptionUserAction": "Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.",
    "relatedHTTPCode": 404
}

From a script we can get this out with

➜  ~ cat > /tmp/t
{
    "actionDescription": "getActiveServerStatus",
    "class": "OMAGServerStatusResponse",
    "exceptionClassName": "org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException",
    "exceptionErrorMessage": "OMAG-MULTI-TENANT-404-001 The OMAG Server cocoMDS99 is not available to service a request from user admin",
    "exceptionErrorMessageId": "OMAG-MULTI-TENANT-404-001",
    "exceptionErrorMessageParameters": [
        "cocoMDS99",
        "admin"
    ],
    "exceptionProperties": {
        "parameterName": "serverName",
        "serverName": "cocoMDS99"
    },
    "exceptionSystemAction": "The system is unable to process the request because the server is not running on the called platform.",
    "exceptionUserAction": "Verify that the correct server is being called on the correct platform and that this server is running. Retry the request when the server is available.",
    "relatedHTTPCode": 404
}
➜  ~ cat /tmp/t | jq .relatedHTTPCode
404

which may be sufficient (though jq is likely not on our base image, nor python.. so may need to add - or do in script

0 replies

planetf1 · 2023-05-18T18:03:20Z

planetf1
May 18, 2023
Maintainer Author

Adding 'jq' (I use this - or python - mostly) is probably the most pragmatic & understandable. An awk version could work for this case, but less readable/ likely to be more buggy

0 replies

planetf1 · 2023-05-19T07:50:07Z

planetf1
May 19, 2023
Maintainer Author

If we assume that a good status is both only true, and sufficient, when the relatedHTTPCode is 200, then a very simple check can be done by doing an exec with the above curl command, and piping to a simple regex -- in the test case this might becat /tmp/t | grep '.*\"relatedHTTPCode\": 200'

This does not solve a more general case, has or even begin to address aggregated results, but will work today with the existing platform in determining a server is at least running. Generally an external check is also much better than relying on exec too. However, I will add this note to the docs

0 replies

planetf1 · 2023-05-19T10:13:45Z

planetf1
May 19, 2023
Maintainer Author

HOWEVER ...

A server starting up will also return 200 -- yet it is not ready

{
"class": "OMAGServerStatusResponse",
"relatedHTTPCode": 200,
"serverStatus": {
    "serverName": "cocoMDS2",
    "serverType": "Metadata Server",
    "serverActiveStatus": "STARTING",
    "services": [
        {
            "serviceName": "Open Metadata Repository Services (OMRS)",
            "serviceStatus": "STARTING"
        }
    ]
}

Therefore the relatedHTTPCode is insufficient.

Using jq seems inevitable for any check to be possible without adding new rest api calls. A better solution is that we need clean status check calls.

0 replies

davidradl · 2023-05-19T11:05:55Z

davidradl
May 19, 2023
Maintainer

As discussed with Nigel, I think a new healthcheck API would make sense which returns the 200 or not as per Kubenetes expectations. I envisage for each servertype it checks the status of its capabilities - initially this could be using existing status calls. If these are not adequate will could add more logic in this server based healthcheck API shim layer. For example it might want to issue a call to a downstream OMAS to see if it is working. The glossary author UI issues get all glossaries with a page size of 1 to see if the backend if active and the UI is usable.; we could do something similar in this layer if we need to.

0 replies

planetf1 · 2023-05-19T17:45:50Z

planetf1
May 19, 2023
Maintainer Author

I'll take a look at prototyping this

0 replies

planetf1 · 2023-05-24T22:32:11Z

planetf1
May 24, 2023
Maintainer Author

I've merged the first version of some prototype code into the egeria-cloudnative repository

This is very rudimentary but offers a basic API to see if a server is running.

Next steps may include

Test this chassis in a container, with real k8s healthchecks
Consider what resources should provide health check endpoints
Decide on what the URLs should be
Look at status in more detail & propose a more complete algorithm
Decide on HTTP Status codes & messages to return

As well as some tech omissions like

Remove hardcoded userid
remove commented code
consolidate gradle file
Sort out certificates

However I thought it worth checking in as-is since

This sets up the repo with a gradle build
This demonstrates one way to build a customized chassis external to egeria (at least 3 different orgs wanted to do something like this)
It provides a basic endpoint

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthchecks #7686

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Healthchecks #7686

planetf1 May 18, 2023 Maintainer

Replies: 7 comments

planetf1 May 18, 2023 Maintainer Author

planetf1 May 18, 2023 Maintainer Author

planetf1 May 19, 2023 Maintainer Author

planetf1 May 19, 2023 Maintainer Author

davidradl May 19, 2023 Maintainer

planetf1 May 19, 2023 Maintainer Author

planetf1 May 24, 2023 Maintainer Author

planetf1
May 18, 2023
Maintainer

planetf1
May 18, 2023
Maintainer Author

planetf1
May 18, 2023
Maintainer Author

planetf1
May 19, 2023
Maintainer Author

planetf1
May 19, 2023
Maintainer Author

davidradl
May 19, 2023
Maintainer

planetf1
May 19, 2023
Maintainer Author

planetf1
May 24, 2023
Maintainer Author