Skip to content
This repository has been archived by the owner on Jun 4, 2024. It is now read-only.

fix leaking go-routines in event-handler watcher #963

Conversation

miguelvalerio
Copy link
Contributor

Description

We had a misconfiguration that broke the communication between the event-handler and teleport, which in our case lead to the memory to quickly grow to unreasonable values.

Upon further inspection, we believe that this was due to the fact that the Events call is creating a go-routine that is not terminating when the last event on a page was processed and calling fetch returns an error. Since in our case the error returned to the EventsJob was considered a connetion problem, this means that Events would be called yet again, leading to go-routines that would not exit until the connetion issue was fixed being continuously spawned.

This PR attempts to fix that

Additional Info

Memory:
image

Logs:

2023-10-12T21:35:46+01:00       Stack Trace:
2023-10-12T21:35:46+01:00       Original Error: *trace.TraceErr breaker is tripped
2023-10-12T21:35:46+01:00       ERROR REPORT:
2023-10-12T21:35:46+01:00       ERRO   Failed to connect to Teleport Auth server. Reconnecting... error:[
2023-10-12T21:35:46+01:00       ERRO   Error ingesting Audit Log err:breaker is tripped event-handler/events_job.go:96
2023-10-12T21:35:46+01:00       User Message: breaker is tripped] event-handler/events_job.go:69
2023-10-12T21:35:46+01:00               /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:281 main.(*TeleportEventsWatcher).Events.func1
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:141 main.(*TeleportEventsWatcher).fetch
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:213 main.(*TeleportEventsWatcher).getEvents
2023-10-12T21:35:46+01:00               /go/pkg/mod/github.com/gravitational/teleport/[email protected]/client/client.go:2227 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
2023-10-12T21:35:46+01:00       Stack Trace:
2023-10-12T21:35:46+01:00       Original Error: *trace.TraceErr breaker is tripped
2023-10-12T21:35:46+01:00       ERROR REPORT:
2023-10-12T21:35:46+01:00       ERRO   Failed to connect to Teleport Auth server. Reconnecting... error:[
2023-10-12T21:35:46+01:00       ERRO   Error ingesting Audit Log err:breaker is tripped event-handler/events_job.go:96
2023-10-12T21:35:46+01:00       User Message: breaker is tripped] event-handler/events_job.go:69
2023-10-12T21:35:46+01:00               /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:281 main.(*TeleportEventsWatcher).Events.func1
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:141 main.(*TeleportEventsWatcher).fetch
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:213 main.(*TeleportEventsWatcher).getEvents
2023-10-12T21:35:46+01:00               /go/pkg/mod/github.com/gravitational/teleport/[email protected]/client/client.go:2227 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
2023-10-12T21:35:46+01:00       Stack Trace:
2023-10-12T21:35:46+01:00       Original Error: *trace.TraceErr breaker is tripped
2023-10-12T21:35:46+01:00       ERROR REPORT:
2023-10-12T21:35:46+01:00       ERRO   Failed to connect to Teleport Auth server. Reconnecting... error:[
2023-10-12T21:35:46+01:00       ERRO   Error ingesting Audit Log err:breaker is tripped event-handler/events_job.go:96
2023-10-12T21:35:46+01:00       User Message: breaker is tripped] event-handler/events_job.go:69
2023-10-12T21:35:46+01:00               /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:281 main.(*TeleportEventsWatcher).Events.func1
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:141 main.(*TeleportEventsWatcher).fetch
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:213 main.(*TeleportEventsWatcher).getEvents
2023-10-12T21:35:46+01:00               /go/pkg/mod/github.com/gravitational/teleport/[email protected]/client/client.go:2227 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
2023-10-12T21:35:46+01:00       Stack Trace:
2023-10-12T21:35:46+01:00       Original Error: *trace.TraceErr breaker is tripped
2023-10-12T21:35:46+01:00       ERROR REPORT:
2023-10-12T21:35:46+01:00       ERRO   Failed to connect to Teleport Auth server. Reconnecting... error:[
2023-10-12T21:35:46+01:00       ERRO   Error ingesting Audit Log err:breaker is tripped event-handler/events_job.go:96
2023-10-12T21:35:46+01:00       User Message: breaker is tripped] event-handler/events_job.go:69
2023-10-12T21:35:46+01:00               /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:281 main.(*TeleportEventsWatcher).Events.func1
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:141 main.(*TeleportEventsWatcher).fetch
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:213 main.(*TeleportEventsWatcher).getEvents
2023-10-12T21:35:46+01:00               /go/pkg/mod/github.com/gravitational/teleport/[email protected]/client/client.go:2227 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
2023-10-12T21:35:46+01:00       Stack Trace:
2023-10-12T21:35:46+01:00       Original Error: *trace.TraceErr breaker is tripped
2023-10-12T21:35:46+01:00       ERROR REPORT:
2023-10-12T21:35:46+01:00       ERRO   Failed to connect to Teleport Auth server. Reconnecting... error:[
2023-10-12T21:35:46+01:00       ERRO   Error ingesting Audit Log err:breaker is tripped event-handler/events_job.go:96
2023-10-12T21:35:46+01:00       User Message: breaker is tripped] event-handler/events_job.go:69
2023-10-12T21:35:46+01:00               /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:281 main.(*TeleportEventsWatcher).Events.func1
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:141 main.(*TeleportEventsWatcher).fetch
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:213 main.(*TeleportEventsWatcher).getEvents
2023-10-12T21:35:46+01:00               /go/pkg/mod/github.com/gravitational/teleport/[email protected]/client/client.go:2227 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
2023-10-12T21:35:46+01:00       Stack Trace:
2023-10-12T21:35:46+01:00       Original Error: *trace.TraceErr breaker is tripped
2023-10-12T21:35:46+01:00       ERROR REPORT:
2023-10-12T21:35:46+01:00       ERRO   Failed to connect to Teleport Auth server. Reconnecting... error:[
2023-10-12T21:35:46+01:00       ERRO   Error ingesting Audit Log err:breaker is tripped event-handler/events_job.go:96
2023-10-12T21:35:46+01:00       User Message: breaker is tripped] event-handler/events_job.go:69
2023-10-12T21:35:46+01:00               /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:281 main.(*TeleportEventsWatcher).Events.func1
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:141 main.(*TeleportEventsWatcher).fetch
2023-10-12T21:35:46+01:00               /workspace/event-handler/teleport_events_watcher.go:213 main.(*TeleportEventsWatcher).getEvents
2023-10-12T21:35:46+01:00               /go/pkg/mod/github.com/gravitational/teleport/[email protected]/client/client.go:2227 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents

@miguelvalerio miguelvalerio marked this pull request as ready for review October 21, 2023 16:51
Copy link
Contributor

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! A few suggestions - let me know if you're interested in addressing them or if you want us to take it over.

func (c *errMockTeleportEventWatcher) SearchUnstructuredEvents(ctx context.Context, fromUTC, toUTC time.Time, namespace string, eventTypes []string, limit int, order types.EventOrder, startKey string) ([]*auditlogpb.EventUnstructured, string, error) {
if !c.called {
e := c.events
c.events = make([]events.AuditEvent, 0) // nullify events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.events = make([]events.AuditEvent, 0) // nullify events
c.events = nil

event-handler/teleport_events_watcher_test.go Show resolved Hide resolved

// SearchEvents is mock SearchEvents method which returns events
func (c *errMockTeleportEventWatcher) SearchUnstructuredEvents(ctx context.Context, fromUTC, toUTC time.Time, namespace string, eventTypes []string, limit int, order types.EventOrder, startKey string) ([]*auditlogpb.EventUnstructured, string, error) {
if !c.called {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would invert this check.

Suggested change
if !c.called {
if c.called {
return errors.New("...")
}

return e, "test", nil
}

// StreamSessionEvents returns session events stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// StreamSessionEvents returns session events stream

I don't think we need godocs on the no-op implementations.

event-handler/teleport_events_watcher_test.go Outdated Show resolved Hide resolved
Comment on lines 233 to 217
defer cancel()

t.Cleanup(cancel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to call cancel twice. Either one of these is fine, so pick one and get rid of the other.

chEvt, _ := client.Events(ctx)
// we're assuming that a closed channel == closed goroutine
for range chEvt {
require.Fail(t, "received unexpected event")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't fail a test in a goroutine like this. It's only safe to fail the test from the test's main goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I thought I was using assert. I just ended up removing this call, don't think we need it.

event-handler/teleport_events_watcher_test.go Show resolved Hide resolved
}()
}

select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably simplify this and remove the waitgroup entirely. Each goroutine can write to allDone when it finishes, and then here we just wait for all 5.

	for i := 0; i < 5; i++ {
		select {
		case <-allDone:
		case <-ctx.Done():
			require.Fail(...)
		}
	}

If you decide to take this route, I would ignore my prior comments about removing the constant for nItrs since it would be used in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up going with this option, let me know what you think.

@miguelvalerio miguelvalerio force-pushed the fix-event-handler-leaking-go-routines branch from 7c42516 to d8e4fa6 Compare October 22, 2023 00:14
Copy link
Contributor

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better - a couple more comments.

require.NoError(t, err)
case e := <-chEvt:
require.NotNil(t, e.Event)
require.Equal(t, e.ID, "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Equal(t, e.ID, "1")
require.Equal(t, "1", e.ID)

Comment on lines 212 to 214
const (
nIters = 5
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this right before the loop where it is first used, and declare it on a single line.

const nIters = 5


chEvt, chErr := client.Events(ctx)

select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one of these cases will be hit - which one do we expect? I would write the test in a way that it fails if an unexpected case is executed, rather than writing it so that it can succeed in multiple ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, I think we don't ever receive a nil error in the chErr, so I think probably changing the corresponding case to just be a require.Fail should be fine. Do you agree? If so I wouldn't mind doing this change for the existing test from which I copied this boilerplate as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@miguelvalerio miguelvalerio force-pushed the fix-event-handler-leaking-go-routines branch from d8e4fa6 to fdee29d Compare October 23, 2023 15:37
@zmb3
Copy link
Contributor

zmb3 commented Oct 28, 2023

@Joerger can you have a look and buddy this next week?

@Joerger
Copy link
Contributor

Joerger commented Nov 3, 2023

Hi @miguelvalerio, thank you for your contribution!

I've opened a buddy PR to get this merged after internal review - #966

@Joerger Joerger closed this Nov 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants