Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log: if there's a crash, final telemetry logs might not be sent #1203

Open
huntergregory opened this issue Jan 8, 2025 · 4 comments · May be fixed by #1227
Open

log: if there's a crash, final telemetry logs might not be sent #1203

huntergregory opened this issue Jan 8, 2025 · 4 comments · May be fixed by #1227
Assignees
Labels
Milestone

Comments

@huntergregory
Copy link
Contributor

huntergregory commented Jan 8, 2025

EDIT: there is a method ShutdownAppInsights() which is deferred in a few places in the legacy agent and hubble-based agent. Also via TrackPanic() e.g. here. However, panics in go routines without TrackPanic() might result in missing telemetry.

To ensure logs are sent in case of Retina's Application Insights implementation, operator should issue a ShutdownAppInsights() call.

Stretch Goal

A stretch goal would be to ensure ShutdownAppInsights() before/after all panics. Relevant links:

@timraymond
Copy link
Member

A slightly better way I came up with for doing this is a variant of panicwrap (but has no dependencies to deal with). The container runtime execs Retina with some flag (for example, --supervise) that says it should supervise itself, on startup Retina then execs itself with the same arguments (after removing --supervise). The supervising parent retina process monitors STDERR from the child as panicwrap does, then forwards any panics it finds to App Insights with the appropriate metadata.

Care needs to be taken to properly handle the executable path (as well as injecting it into the args of the child), and also properly handling exit codes so that the orchestrator isn't confused by what we're doing.

This is a PoC that shows the technique described: https://github.com/timraymond/dontpanic/blob/master/main.go

@timraymond
Copy link
Member

Also, from @matmerr :

...we use the apiserver, podname and nodename dimensions... we'd need that information to be attached to the panic telemetry as well

@mereta
Copy link
Contributor

mereta commented Jan 13, 2025

@timraymond @huntergregory I've had a quick look and I've a question on the legacy operator.
This is the set up for it below:

if oconfig.EnableTelemetry && buildinfo.ApplicationInsightsID != "" {

It's littered with os.Exit, which will ignore the defers.
What's the best approach here? For a quick win should we defer all the os.Exit calls too?

@timraymond
Copy link
Member

@mereta yeah, that's... not great. It will probably work to just replace those os.Exit calls with return as a quick win. However, I'm not sure if it will return with the exit code of 1 to keep parity with here... some testing is needed. This is why it's nice to just return errors... it keeps the semantics pretty straightforward (but that ship has sailed at this point for this logic). Unfortunately, deferring os.Exit just has the effect of putting that invocation on the defer stack, so if there's anything behind it in the stack, it won't run.

@mereta mereta linked a pull request Jan 16, 2025 that will close this issue
7 tasks
@mereta mereta linked a pull request Jan 16, 2025 that will close this issue
7 tasks
@ibezrukavyi ibezrukavyi added this to the 1.0 milestone Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants