Skip to content

Commit

Permalink
[service/internal] Allow components to transition from PermanentError…
Browse files Browse the repository at this point in the history
… to Stopping (#10958)

#### Description
In #10058 I mentioned:

> There is a tangentially related issue with PermanentErrors and the
underlying finite state machine that governs transitions between
statuses. Currently, a PermanentError is a final state. That is, once a
component enters this state, no further transitions are allowed. In
light of the work I did on the alternative health check extension, I
believe we should allow a transition from PermanentError to Stopping to
consistently prioritize lifecycle events for components. This transition
also make sense from a practical perspective. A component in a
PermanentError state is one that has been started and is running,
although in a likely degraded state. The collector will call shutdown on
the component (when the collector is shutting down) and we should allow
the status to reflect that.

This PR makes the suggested change and updates the documentation to
reflect that. As this is an internal change, I have not included a
changelog. Also note, we can close #10058 after this as we've already
removed status aggregation from core during the recent component status
refactor.

<!-- Issue number if applicable -->
#### Link to tracking issue
Fixes #10058

<!--Describe what testing was performed and which tests were added.-->
#### Testing
units

<!--Describe the documentation added.-->
#### Documentation
Updated docs/component-status.md and associated diagram.

<!--Please delete paragraphs that you did not use before submitting.-->

Co-authored-by: Tyler Helmuth <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
  • Loading branch information
3 people authored Dec 17, 2024
1 parent 4fc50a8 commit 58a5ffc
Show file tree
Hide file tree
Showing 5 changed files with 16 additions and 7 deletions.
3 changes: 2 additions & 1 deletion docs/component-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ There is a finite state machine underlying the status reporting API that governs

![State Diagram](img/component-status-state-diagram.png)

The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError and FatalError are permanent runtime states. A component in these states cannot make any further state transitions.
The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError is a permanent runtime state. A component in a PermanentError state cannot transtion to OK or RecoverableError, but it can transition to Stopping. FatalError is a final state. A component in a FatalError state cannot make any further state transitions.

![Status Event Generation](img/component-status-event-generation.png)

Expand All @@ -61,6 +61,7 @@ Under most circumstances, a component does not need to report explicit status du
**Runtime**

![Runtime State Diagram](img/component-status-runtime-states.png)

During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.

We intend to define guidelines to help component authors distinguish between recoverable and permanent errors on a per-component type basis and we'll update this document as we make decisions. See [this issue](https://github.com/open-telemetry/opentelemetry-collector/issues/9957) for current thoughts and discussions.
Expand Down
Binary file modified docs/img/component-status-state-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions service/internal/graph/lifecycle_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,8 @@ func TestStatusReportedOnStartupShutdown(t *testing.T) {
instanceIDs[eStErr]: {
componentstatus.NewEvent(componentstatus.StatusStarting),
componentstatus.NewPermanentErrorEvent(assert.AnError),
componentstatus.NewEvent(componentstatus.StatusStopping),
componentstatus.NewEvent(componentstatus.StatusStopped),
},
},
startupErr: assert.AnError,
Expand All @@ -362,6 +364,8 @@ func TestStatusReportedOnStartupShutdown(t *testing.T) {
instanceIDs[rStErr]: {
componentstatus.NewEvent(componentstatus.StatusStarting),
componentstatus.NewPermanentErrorEvent(assert.AnError),
componentstatus.NewEvent(componentstatus.StatusStopping),
componentstatus.NewEvent(componentstatus.StatusStopped),
},
instanceIDs[eNoErr]: {
componentstatus.NewEvent(componentstatus.StatusStarting),
Expand Down
6 changes: 4 additions & 2 deletions service/internal/status/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,10 @@ func newFSM(onTransition onTransitionFunc) *fsm {
componentstatus.StatusFatalError: {},
componentstatus.StatusStopping: {},
},
componentstatus.StatusPermanentError: {},
componentstatus.StatusFatalError: {},
componentstatus.StatusPermanentError: {
componentstatus.StatusStopping: {},
},
componentstatus.StatusFatalError: {},
componentstatus.StatusStopping: {
componentstatus.StatusRecoverableError: {},
componentstatus.StatusPermanentError: {},
Expand Down
10 changes: 6 additions & 4 deletions service/internal/status/status_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,17 +75,19 @@ func TestStatusFSM(t *testing.T) {
expectedErrorCount: 2,
},
{
name: "PermanentError is terminal",
name: "PermanentError is stoppable",
reportedStatuses: []componentstatus.Status{
componentstatus.StatusStarting,
componentstatus.StatusOK,
componentstatus.StatusPermanentError,
componentstatus.StatusOK,
componentstatus.StatusStopping,
},
expectedStatuses: []componentstatus.Status{
componentstatus.StatusStarting,
componentstatus.StatusOK,
componentstatus.StatusPermanentError,
componentstatus.StatusStopping,
},
expectedErrorCount: 1,
},
Expand Down Expand Up @@ -154,7 +156,7 @@ func TestValidSeqsToStopped(t *testing.T) {
}

for _, ev := range events {
name := fmt.Sprintf("transition from: %s to: %s invalid", ev.Status(), componentstatus.StatusStopped)
name := fmt.Sprintf("transition from: %s to: %s", ev.Status(), componentstatus.StatusStopped)
t.Run(name, func(t *testing.T) {
fsm := newFSM(func(*componentstatus.Event) {})
if ev.Status() != componentstatus.StatusStarting {
Expand All @@ -165,9 +167,9 @@ func TestValidSeqsToStopped(t *testing.T) {
err := fsm.transition(componentstatus.NewEvent(componentstatus.StatusStopped))
require.ErrorIs(t, err, errInvalidStateTransition)

// stopping -> stopped is allowed for non-fatal, non-permanent errors
// stopping -> stopped is allowed for non-fatal errors
err = fsm.transition(componentstatus.NewEvent(componentstatus.StatusStopping))
if ev.Status() == componentstatus.StatusPermanentError || ev.Status() == componentstatus.StatusFatalError {
if ev.Status() == componentstatus.StatusFatalError {
require.ErrorIs(t, err, errInvalidStateTransition)
} else {
require.NoError(t, err)
Expand Down

0 comments on commit 58a5ffc

Please sign in to comment.