Add a way to introspect failing/restarting services #104

benhoyt · 2022-02-11T04:13:47Z

Per discussion at #86 (comment), it wasn't obvious when ServiceInfo.Restarts was reset to 0 (if at all), and whether this was the right thing to expose. Knowing whether Pebble has restarted 1000 times vs 1001 isn't particularly useful, for example.

On the other hand, when debugging a f(l)ailing service, we want the user to be able to see how often it's restarting. We kind of want a differential, like "restarts per minute" or "restarts in last 10 minutes". But that might be just as hard to explain, so maybe the absolute number is simplest and the user can see how fast it's going up over by querying over time, or by knowing when Pebble was started.

I lean towards keeping this field -- it seems quite useful to me -- and having it never reset (until Pebble restarts). It would also be exposed in pebble services.

The text was updated successfully, but these errors were encountered:

benhoyt · 2022-03-09T12:31:49Z

During our discussion at the Frankfurt sprint, we discussed a few potential options, leaning towards using the existing "change" system so that things are tracked:

Add Restarts and RestartsSpan, giving the number of restarts in the last N minutes (e.g., 6 restarts in the last 1 minute). Implementation: a sliding window of restart times, keeping only those in the last N minutes.
Add a LastStart timestamp (or perhaps Uptime) field. Trivial implementation: just store time.Now() whenever a service is started.
Add a Pebble "change" object whenever we restart the service (viewable with pebble changes). Restarting a service is a fairly invasive operation, so perhaps this isn't too heavy. Implementation: a new auto-restart change type with a single auto-restart task. In servstate/handlers.go, backoffTimeElapsed would create the auto-restart change and immediately start the task, instead of calling startInternal directly.

Or some combination of the above, for example, I currently lean towards doing both 2 and 3: adding LastStart as well as recording a change.

benhoyt · 2022-03-29T07:24:38Z

Notes from meeting just now:

Have a change that's "recovering this service" when it enters the backoff loop, then one task per restart?
But the last task would stay open (in Doing state), meaning "I'm waiting and monitoring the service". Until the backoff is reset after backoff-limit time and you go Done.
Alternatively: a single task with a number of restarts and a log? Maybe try them both and see what looks/works better.
StartTime on the services API would be useful too.

This will enable debugging/introspection on whether a service is failing and being auto-restarted. You can introspect it with the changes API or the "pebble changes" and "pebble tasks" CLI commands. The kind (type) of the changes is "recover" and the kind of the tasks is "restart". Example output while I service is begin recovered (change 3 is not yet ready): $ pebble changes ID Status Spawn Ready Summary 1 Done today at 16:08 NZST today at 16:08 NZST Autostart service "test2" 2 Done today at 16:08 NZST today at 16:09 NZST Recover service "test2" 3 Done today at 16:09 NZST - Recover service "test2" After it's recovered the change is marked done/Ready: $ pebble changes ID Status Spawn Ready Summary 1 Done today at 16:08 NZST today at 16:08 NZST Autostart service "test2" 2 Done today at 16:08 NZST today at 16:09 NZST Recover service "test2" 3 Done today at 16:09 NZST today at 16:10 NZST Recover service "test2" $ pebble tasks 2 Status Spawn Ready Summary Done today at 16:08 NZST today at 16:08 NZST Restart service "test2" Done today at 16:08 NZST today at 16:08 NZST Restart service "test2" Done today at 16:09 NZST today at 16:09 NZST Restart service "test2" $ pebble tasks 3 Status Spawn Ready Summary Done today at 16:09 NZST today at 16:09 NZST Restart service "test2" Fixes canonical#104

benhoyt · 2022-06-20T22:35:38Z

Per comment on #117 (comment), we're planning to do this with events (a new concept) instead, and have started spec JU048 to design that.

This PR adds a current-since field to the API, meaning the time at which the "current" field last changed, for example started (current changed from inactive to active), and so on. It's related to #104 (though in a previous iteration it was "start time", not "curren since"). Here's how it looks like the CLI: $ pebble services Service Startup Current Since test2 enabled active today at 17:08 NZST $ pebble services --abs-time Service Startup Current Since test2 enabled active 2022-04-28T17:08:45+12:00 This PR also removes the "restarts" field which isn't being used and was internal-only. Per #104, we're going to use changes/tasks or events instead.

benhoyt · 2023-01-25T03:47:24Z

Similar to what we're planning with check failures - comment, where we kind of came full circle back around to Changes and Tasks ... can we figure out a way to do this with Changes and Tasks (but without creating huge numbers of changes in case of a bad failure)?

benhoyt · 2024-03-13T01:26:49Z

We're going to start with doing this for health checks (spec here). This still seems useful bit is lower-priority.

benhoyt mentioned this issue Feb 11, 2022

Checks/health API and "pebble checks" CLI command #86

Merged

benhoyt mentioned this issue Mar 29, 2022

Add a way to detect and introspect check failures #103

Closed

benhoyt mentioned this issue Apr 28, 2022

Add "current-since" field to services API and CLI #115

Merged

benhoyt mentioned this issue May 2, 2022

Record "recover-service" change when recovering/restarting a service #117

Closed

benhoyt changed the title ~~Add something similar to "Restarts" back into Services API~~ Add a way to introspect failing/restarting services Jun 20, 2022

benhoyt added the 24.04 label Mar 13, 2024

benhoyt assigned benhoyt and unassigned benhoyt Mar 13, 2024

benhoyt added Feature A feature request Needs Design Needs a design or spec and removed 24.04 labels Mar 13, 2024

benhoyt added the Low Priority The opposite of "Priority" label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to introspect failing/restarting services #104

Add a way to introspect failing/restarting services #104

benhoyt commented Feb 11, 2022 •

edited

Loading

benhoyt commented Mar 9, 2022 •

edited

Loading

benhoyt commented Mar 29, 2022

benhoyt commented Jun 20, 2022

benhoyt commented Jan 25, 2023

benhoyt commented Mar 13, 2024

Add a way to introspect failing/restarting services #104

Add a way to introspect failing/restarting services #104

Comments

benhoyt commented Feb 11, 2022 • edited Loading

benhoyt commented Mar 9, 2022 • edited Loading

benhoyt commented Mar 29, 2022

benhoyt commented Jun 20, 2022

benhoyt commented Jan 25, 2023

benhoyt commented Mar 13, 2024

benhoyt commented Feb 11, 2022 •

edited

Loading

benhoyt commented Mar 9, 2022 •

edited

Loading