Observability #3393

sbrunner · 2024-09-03T09:31:55Z

Introduction

Currently, we get some lag around the observability of the application, then here we defined how it should be.

Notes that here we define the general framework, not all the specific cases event if we pout the first wanted implementations.

We target in priority the Kubernetes/Docker environment, then some words comes from this world.

Usage of the health checks.

This will update the Result or the /metrics/healthcheck endpoint.

Examples of the responses

When we call the `healthy` method:

{
    "application": {
        "healthy": true,
        "message": "sbr test.",
        "duration": 0,
        "timestamp": "2024-08-29T13:44:58.398Z"
    }
}

Response code: HTTP code 200.

When we call the `unhealthy` method:

{
    "application": {
        "healthy": false,
        "message": "sbr test.",
        "duration": 0,
        "timestamp": "2024-08-29T13:44:58.398Z"
    }
}

Response code: HTTP status code 200 (or 500 when we set JAVA_OPTS to -DhttpStatusIndicator=true)

If we raise an exception:

{
    "application": {
        "healthy": false,
        "message": "sbr test.",
        "error": {
            "type": "java.lang.RuntimeException",
            "message": "sbr test.",
            "stack": [
                "org.mapfish.print.metrics.ApplicationStatus.check(ApplicationStatus.java:15)", 
                "com.codahale.metrics.health.HealthCheck.execute(HealthCheck.java:374)", 
                "com.codahale.metrics.health.HealthCheckRegistry.runHealthChecks(HealthCheckRegistry.java:184)", 
                "com.codahale.metrics.servlets.HealthCheckServlet.runHealthChecks(HealthCheckServlet.java:177)", 
                "com.codahale.metrics.servlets.HealthCheckServlet.doGet(HealthCheckServlet.java:146)", 
                "javax.servlet.http.HttpServlet.service(HttpServlet.java:529)", 
                "javax.servlet.http.HttpServlet.service(HttpServlet.java:623)", 
                "com.codahale.metrics.servlets.AdminServlet.service(AdminServlet.java:153)", 
                "javax.servlet.http.HttpServlet.service(HttpServlet.java:623)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:199)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:168)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "com.thetransactioncompany.cors.CORSFilter.doFilter(CORSFilter.java:209)", 
                "com.thetransactioncompany.cors.CORSFilter.doFilter(CORSFilter.java:244)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:168)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "com.codahale.metrics.servlet.AbstractInstrumentedFilter.doFilter(AbstractInstrumentedFilter.java:112)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:168)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:352)", 
                "org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:117)", 
                "org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:83)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:126)", 
                "org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:120)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.session.SessionManagementFilter.doFilter(SessionManagementFilter.java:131)", 
                "org.springframework.security.web.session.SessionManagementFilter.doFilter(SessionManagementFilter.java:85)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", "org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:100)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:164)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", "org.springframework.security.web.savedrequest.RequestCacheAwareFilter.doFilter(RequestCacheAwareFilter.java:63)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.authentication.www.BasicAuthenticationFilter.doFilterInternal(BasicAuthenticationFilter.java:168)",
                "org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.header.HeaderWriterFilter.doHeadersAfter(HeaderWriterFilter.java:90)", 
                "org.springframework.security.web.header.HeaderWriterFilter.doFilterInternal(HeaderWriterFilter.java:75)", 
                "org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.context.request.async.WebAsyncManagerIntegrationFilter.doFilterInternal(WebAsyncManagerIntegrationFilter.java:62)", 
                "org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:117)", 
                "org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:87)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.access.channel.ChannelProcessingFilter.doFilter(ChannelProcessingFilter.java:133)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.session.DisableEncodeUrlFilter.doFilterInternal(DisableEncodeUrlFilter.java:42)", 
                "org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)", 
                "org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:361)", 
                "org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:225)", 
                "org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:190)", 
                "org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:354)", 
                "org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:267)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:168)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)", 
                "org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:117)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:168)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "org.mapfish.print.servlet.RequestSizeFilter.doFilter(RequestSizeFilter.java:40)", 
                "org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:168)", 
                "org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:144)", 
                "org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:168)", 
                "org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)", 
                "org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)", 
                "org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:130)", 
                "org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)", 
                "org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)", 
                "org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:346)", 
                "org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:388)", 
                "org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)", 
                "org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:936)", 
                "org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791)", 
                "org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)", 
                "org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1190)", 
                "org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)", 
                "org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)", 
                "java.base/java.lang.Thread.run(Thread.java:829)"
            ]                
        },
        "duration": 0,
        "timestamp": "2024-08-29T13:46:38.000Z"
    }
}

Response code: HTTP code 200.

Propose usage

Use this endpoint only in the Kubernetes endpoint to automatically restart the Pod.

Use the heathy and unhealthy method to change the view status.

Add the option httpStatusIndicator=true in the file core/src/main/resources/mapfish-spring.properties.

At first, we should get an error when we don't consume the queue event if she is not empty, at therm I think that this check should be something like this (for this check we need a time window.):

If the queue is empty during the time window => healthy
If a print job ends during the time window => healthy
Otherwise => unhealthy

It's possible that we need a check that tests the building of an epsg code, in the past we get on some container this exception:

java.lang.RuntimeException: EPSG:2056 was not recognized as a crs code
	at org.mapfish.print.output.Values.populateFromAttributes(Values.java:229)
	at org.mapfish.print.output.Values.<init>(Values.java:153)
	at org.mapfish.print.output.Values.<init>(Values.java:110)
	at org.mapfish.print.output.AbstractJasperReportOutputFormat.getJasperPrint(AbstractJasperReportOutputFormat.java:137)
	at org.mapfish.print.output.AbstractJasperReportOutputFormat.print(AbstractJasperReportOutputFormat.java:94)
	at org.mapfish.print.MapPrinter.print(MapPrinter.java:133)
	at org.mapfish.print.servlet.job.PrintJob.lambda$call$0(PrintJob.java:148)
	at org.mapfish.print.servlet.job.PrintJob.withOpenOutputStream(PrintJob.java:118)
	at org.mapfish.print.servlet.job.PrintJob.call(PrintJob.java:147)
	at org.mapfish.print.servlet.job.PrintJob.call(PrintJob.java:54)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: EPSG:2056 was not recognized as a crs code
	at org.mapfish.print.attribute.map.GenericMapAttribute.parseProjection(GenericMapAttribute.java:93)
	at org.mapfish.print.attribute.map.GenericMapAttribute$GenericMapAttributeValues.parseProjection(GenericMapAttribute.java:516)
	at org.mapfish.print.attribute.map.MapAttribute$MapAttributeValues.parseBounds(MapAttribute.java:164)
	at org.mapfish.print.attribute.map.MapAttribute$MapAttributeValues.postConstruct(MapAttribute.java:160)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.mapfish.print.parser.MapfishParser.parse(MapfishParser.java:138)
	at org.mapfish.print.attribute.ReflectiveAttribute.getValue(ReflectiveAttribute.java:428)
	at org.mapfish.print.output.Values.populateFromAttributes(Values.java:203)
	... 13 common frames omitted
Caused by: org.opengis.referencing.NoSuchAuthorityCodeException: No code "EPSG:2056" from authority "European Petroleum Survey Group" found for object of type "IdentifiedObject".
	at org.geotools.referencing.factory.AbstractAuthorityFactory.noSuchAuthorityCode(AbstractAuthorityFactory.java:874)
	at org.geotools.referencing.factory.PropertyAuthorityFactory.getWKT(PropertyAuthorityFactory.java:289)
	at org.geotools.referencing.factory.PropertyAuthorityFactory.createCoordinateReferenceSystem(PropertyAuthorityFactory.java:358)
	at org.geotools.referencing.factory.BufferedAuthorityFactory.createCoordinateReferenceSystem(BufferedAuthorityFactory.java:731)
	at org.geotools.referencing.factory.AuthorityFactoryAdapter.createCoordinateReferenceSystem(AuthorityFactoryAdapter.java:779)
	at org.geotools.referencing.factory.FallbackAuthorityFactory.createCoordinateReferenceSystem(FallbackAuthorityFactory.java:624)
	at org.geotools.referencing.factory.AuthorityFactoryAdapter.createCoordinateReferenceSystem(AuthorityFactoryAdapter.java:779)
	at org.geotools.referencing.factory.ThreadedAuthorityFactory.createCoordinateReferenceSystem(ThreadedAuthorityFactory.java:635)
	at org.geotools.referencing.DefaultAuthorityFactory.createCoordinateReferenceSystem(DefaultAuthorityFactory.java:176)
	at org.geotools.referencing.CRS.decode(CRS.java:517)
	at org.geotools.referencing.CRS.decode(CRS.java:433)
	at org.mapfish.print.attribute.map.GenericMapAttribute.parseProjection(GenericMapAttribute.java:88)
	... 23 common frames omitted

Usage of the metric.

The metrics should be reviewer and documented, currently it's a little mess...

At first, we should add a gauge to observe the queue length and a timer to observe the total print duration.

Then we should review all the metrics, see if they're working, update/remove them if needed, add documentation.

Pertinent metric:

Around print jobs
- number of waiting/running/success/failed jobs with argument: application / template (new)
- Time to process a print job with argument: application / template (new)
Around processors:
- Time to process a processor with argument: application? / template? / processor type or class (new)
Around Requests
- Time to process a request with argument: host name (should be verified that working, rename and document them)

Current metrics:

HttpRequestFetcher:
- timer on download
- timer on read by host
- counter on error by host
AbstractSingleImageLayer:
- counter on request error by host
- another counter on request error by host
- a third counter on request error by host
- a counter on image read error (same name than before) by host
- timer on request by host
CoverageTask:
- timer on download by host
- counter on error by host
- another counter on error by host
- a thirst counter on error by host

Cluster check

If we need a check, e.g. to notify that the print job queue it too long` we probably need to create a custom endpoint.

Resume

Use health checks only for health concerned with the current container.

Identify and add missing metrics to be able to better monitor the application with tools like Prometheus/Grafana.

Eventually, add a new endpoint for more specific checks like too long queue.

The text was updated successfully, but these errors were encountered:

sbrunner · 2024-09-20T12:43:56Z

The access logs should also easy be enabled:
See: https://tomcat.apache.org/tomcat-9.0-doc/config/valve.html#Access_Logging

sbrunner mentioned this issue Sep 3, 2024

Application health status check #3379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability #3393

Observability #3393

sbrunner commented Sep 3, 2024 •

edited

Loading

sbrunner commented Sep 20, 2024

Observability #3393

Observability #3393

Comments

sbrunner commented Sep 3, 2024 • edited Loading

Introduction

Usage of the health checks.

Examples of the responses

When we call the healthy method:

When we call the unhealthy method:

If we raise an exception:

Propose usage

Usage of the metric.

Cluster check

Resume

sbrunner commented Sep 20, 2024

sbrunner commented Sep 3, 2024 •

edited

Loading

When we call the `healthy` method:

When we call the `unhealthy` method: