Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to introspect failing/restarting services #104

Open
benhoyt opened this issue Feb 11, 2022 · 5 comments
Open

Add a way to introspect failing/restarting services #104

benhoyt opened this issue Feb 11, 2022 · 5 comments
Labels
Feature A feature request Low Priority The opposite of "Priority" Needs Design Needs a design or spec

Comments

@benhoyt
Copy link
Contributor

benhoyt commented Feb 11, 2022

Per discussion at #86 (comment), it wasn't obvious when ServiceInfo.Restarts was reset to 0 (if at all), and whether this was the right thing to expose. Knowing whether Pebble has restarted 1000 times vs 1001 isn't particularly useful, for example.

On the other hand, when debugging a f(l)ailing service, we want the user to be able to see how often it's restarting. We kind of want a differential, like "restarts per minute" or "restarts in last 10 minutes". But that might be just as hard to explain, so maybe the absolute number is simplest and the user can see how fast it's going up over by querying over time, or by knowing when Pebble was started.

I lean towards keeping this field -- it seems quite useful to me -- and having it never reset (until Pebble restarts). It would also be exposed in pebble services.

@benhoyt
Copy link
Contributor Author

benhoyt commented Mar 9, 2022

During our discussion at the Frankfurt sprint, we discussed a few potential options, leaning towards using the existing "change" system so that things are tracked:

  1. Add Restarts and RestartsSpan, giving the number of restarts in the last N minutes (e.g., 6 restarts in the last 1 minute). Implementation: a sliding window of restart times, keeping only those in the last N minutes.
  2. Add a LastStart timestamp (or perhaps Uptime) field. Trivial implementation: just store time.Now() whenever a service is started.
  3. Add a Pebble "change" object whenever we restart the service (viewable with pebble changes). Restarting a service is a fairly invasive operation, so perhaps this isn't too heavy. Implementation: a new auto-restart change type with a single auto-restart task. In servstate/handlers.go, backoffTimeElapsed would create the auto-restart change and immediately start the task, instead of calling startInternal directly.

Or some combination of the above, for example, I currently lean towards doing both 2 and 3: adding LastStart as well as recording a change.

@benhoyt
Copy link
Contributor Author

benhoyt commented Mar 29, 2022

Notes from meeting just now:

  • Have a change that's "recovering this service" when it enters the backoff loop, then one task per restart?
  • But the last task would stay open (in Doing state), meaning "I'm waiting and monitoring the service". Until the backoff is reset after backoff-limit time and you go Done.
  • Alternatively: a single task with a number of restarts and a log? Maybe try them both and see what looks/works better.
  • StartTime on the services API would be useful too.

benhoyt added a commit to benhoyt/pebble that referenced this issue May 2, 2022
This will enable debugging/introspection on whether a service is
failing and being auto-restarted. You can introspect it with the
changes API or the "pebble changes" and "pebble tasks" CLI commands.

The kind (type) of the changes is "recover" and the kind of the tasks
is "restart".

Example output while I service is begin recovered (change 3 is not yet
ready):

$ pebble changes
ID   Status  Spawn                Ready                Summary
1    Done    today at 16:08 NZST  today at 16:08 NZST  Autostart service "test2"
2    Done    today at 16:08 NZST  today at 16:09 NZST  Recover service "test2"
3    Done    today at 16:09 NZST  -                    Recover service "test2"

After it's recovered the change is marked done/Ready:

$ pebble changes
ID   Status  Spawn                Ready                Summary
1    Done    today at 16:08 NZST  today at 16:08 NZST  Autostart service "test2"
2    Done    today at 16:08 NZST  today at 16:09 NZST  Recover service "test2"
3    Done    today at 16:09 NZST  today at 16:10 NZST  Recover service "test2"

$ pebble tasks 2
Status  Spawn                Ready                Summary
Done    today at 16:08 NZST  today at 16:08 NZST  Restart service "test2"
Done    today at 16:08 NZST  today at 16:08 NZST  Restart service "test2"
Done    today at 16:09 NZST  today at 16:09 NZST  Restart service "test2"

$ pebble tasks 3
Status  Spawn                Ready                Summary
Done    today at 16:09 NZST  today at 16:09 NZST  Restart service "test2"

Fixes canonical#104
@benhoyt benhoyt changed the title Add something similar to "Restarts" back into Services API Add a way to introspect failing/restarting services Jun 20, 2022
@benhoyt
Copy link
Contributor Author

benhoyt commented Jun 20, 2022

Per comment on #117 (comment), we're planning to do this with events (a new concept) instead, and have started spec JU048 to design that.

niemeyer pushed a commit that referenced this issue Dec 5, 2022
This PR adds a current-since field to the API, meaning the time at which the "current" field last changed, for example started (current changed from inactive to active), and so on. It's related to #104 (though in a previous iteration it was "start time", not "curren since").

Here's how it looks like the CLI:

    $ pebble services
    Service  Startup  Current  Since
    test2    enabled  active   today at 17:08 NZST

    $ pebble services --abs-time
    Service  Startup  Current  Since
    test2    enabled  active   2022-04-28T17:08:45+12:00

This PR also removes the "restarts" field which isn't being used and was internal-only. Per #104, we're going to use changes/tasks or events instead.
@benhoyt
Copy link
Contributor Author

benhoyt commented Jan 25, 2023

Similar to what we're planning with check failures - comment, where we kind of came full circle back around to Changes and Tasks ... can we figure out a way to do this with Changes and Tasks (but without creating huge numbers of changes in case of a bad failure)?

@benhoyt benhoyt added the 24.04 label Mar 13, 2024
@benhoyt benhoyt assigned benhoyt and unassigned benhoyt Mar 13, 2024
@benhoyt benhoyt added Feature A feature request Needs Design Needs a design or spec and removed 24.04 labels Mar 13, 2024
@benhoyt
Copy link
Contributor Author

benhoyt commented Mar 13, 2024

We're going to start with doing this for health checks (spec here). This still seems useful bit is lower-priority.

@benhoyt benhoyt added the Low Priority The opposite of "Priority" label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A feature request Low Priority The opposite of "Priority" Needs Design Needs a design or spec
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant