log: if there's a crash, final telemetry logs might not be sent #1203

huntergregory · 2025-01-08T18:45:55Z

EDIT: there is a method ShutdownAppInsights() which is deferred in a few places in the legacy agent and hubble-based agent. Also via TrackPanic() e.g. here. However, panics in go routines without TrackPanic() might result in missing telemetry.

To ensure logs are sent in case of Retina's Application Insights implementation, operator should issue a ShutdownAppInsights() call.

Stretch Goal

A stretch goal would be to ensure ShutdownAppInsights() before/after all panics. Relevant links:

The text was updated successfully, but these errors were encountered:

timraymond · 2025-01-08T21:51:54Z

A slightly better way I came up with for doing this is a variant of panicwrap (but has no dependencies to deal with). The container runtime execs Retina with some flag (for example, --supervise) that says it should supervise itself, on startup Retina then execs itself with the same arguments (after removing --supervise). The supervising parent retina process monitors STDERR from the child as panicwrap does, then forwards any panics it finds to App Insights with the appropriate metadata.

Care needs to be taken to properly handle the executable path (as well as injecting it into the args of the child), and also properly handling exit codes so that the orchestrator isn't confused by what we're doing.

This is a PoC that shows the technique described: https://github.com/timraymond/dontpanic/blob/master/main.go

timraymond · 2025-01-08T21:54:38Z

Also, from @matmerr :

...we use the apiserver, podname and nodename dimensions... we'd need that information to be attached to the panic telemetry as well

mereta · 2025-01-13T15:48:20Z

@timraymond @huntergregory I've had a quick look and I've a question on the legacy operator.
This is the set up for it below:

retina/operator/cmd/legacy/deployment.go

Line 173 in f858e84

if oconfig.EnableTelemetry && buildinfo.ApplicationInsightsID != "" {

It's littered with os.Exit, which will ignore the defers.
What's the best approach here? For a quick win should we defer all the os.Exit calls too?

timraymond · 2025-01-13T16:18:34Z

@mereta yeah, that's... not great. It will probably work to just replace those os.Exit calls with return as a quick win. However, I'm not sure if it will return with the exit code of 1 to keep parity with here... some testing is needed. This is why it's nice to just return errors... it keeps the semantics pretty straightforward (but that ship has sailed at this point for this logic). Unfortunately, deferring os.Exit just has the effect of putting that invocation on the defer stack, so if there's anything behind it in the stack, it won't run.

huntergregory added area/controllers area/operator type/bug Something isn't working labels Jan 8, 2025

github-project-automation bot added this to Retina Triage Board Jan 8, 2025

ibezrukavyi assigned ibezrukavyi and mereta Jan 9, 2025

mereta linked a pull request Jan 16, 2025 that will close this issue

fix(telemetry): Application Insights Shutdown on Retina Operator #1227

Open

7 tasks

ibezrukavyi added this to the 1.0 milestone Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log: if there's a crash, final telemetry logs might not be sent #1203

log: if there's a crash, final telemetry logs might not be sent #1203

huntergregory commented Jan 8, 2025 •

edited

Loading

timraymond commented Jan 8, 2025

timraymond commented Jan 8, 2025

mereta commented Jan 13, 2025

timraymond commented Jan 13, 2025

log: if there's a crash, final telemetry logs might not be sent #1203

log: if there's a crash, final telemetry logs might not be sent #1203

Comments

huntergregory commented Jan 8, 2025 • edited Loading

Stretch Goal

timraymond commented Jan 8, 2025

timraymond commented Jan 8, 2025

mereta commented Jan 13, 2025

timraymond commented Jan 13, 2025

huntergregory commented Jan 8, 2025 •

edited

Loading