-
Notifications
You must be signed in to change notification settings - Fork 78
fix leaking go-routines in event-handler watcher #963
fix leaking go-routines in event-handler watcher #963
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! A few suggestions - let me know if you're interested in addressing them or if you want us to take it over.
func (c *errMockTeleportEventWatcher) SearchUnstructuredEvents(ctx context.Context, fromUTC, toUTC time.Time, namespace string, eventTypes []string, limit int, order types.EventOrder, startKey string) ([]*auditlogpb.EventUnstructured, string, error) { | ||
if !c.called { | ||
e := c.events | ||
c.events = make([]events.AuditEvent, 0) // nullify events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.events = make([]events.AuditEvent, 0) // nullify events | |
c.events = nil |
|
||
// SearchEvents is mock SearchEvents method which returns events | ||
func (c *errMockTeleportEventWatcher) SearchUnstructuredEvents(ctx context.Context, fromUTC, toUTC time.Time, namespace string, eventTypes []string, limit int, order types.EventOrder, startKey string) ([]*auditlogpb.EventUnstructured, string, error) { | ||
if !c.called { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would invert this check.
if !c.called { | |
if c.called { | |
return errors.New("...") | |
} |
return e, "test", nil | ||
} | ||
|
||
// StreamSessionEvents returns session events stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// StreamSessionEvents returns session events stream |
I don't think we need godocs on the no-op implementations.
defer cancel() | ||
|
||
t.Cleanup(cancel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to call cancel twice. Either one of these is fine, so pick one and get rid of the other.
chEvt, _ := client.Events(ctx) | ||
// we're assuming that a closed channel == closed goroutine | ||
for range chEvt { | ||
require.Fail(t, "received unexpected event") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't fail a test in a goroutine like this. It's only safe to fail the test from the test's main goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I thought I was using assert
. I just ended up removing this call, don't think we need it.
}() | ||
} | ||
|
||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably simplify this and remove the waitgroup entirely. Each goroutine can write to allDone
when it finishes, and then here we just wait for all 5.
for i := 0; i < 5; i++ {
select {
case <-allDone:
case <-ctx.Done():
require.Fail(...)
}
}
If you decide to take this route, I would ignore my prior comments about removing the constant for nItrs
since it would be used in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going with this option, let me know what you think.
7c42516
to
d8e4fa6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much better - a couple more comments.
require.NoError(t, err) | ||
case e := <-chEvt: | ||
require.NotNil(t, e.Event) | ||
require.Equal(t, e.ID, "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Equal(t, e.ID, "1") | |
require.Equal(t, "1", e.ID) |
const ( | ||
nIters = 5 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this right before the loop where it is first used, and declare it on a single line.
const nIters = 5
|
||
chEvt, chErr := client.Events(ctx) | ||
|
||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one of these cases will be hit - which one do we expect? I would write the test in a way that it fails if an unexpected case is executed, rather than writing it so that it can succeed in multiple ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I think we don't ever receive a nil error in the chErr
, so I think probably changing the corresponding case
to just be a require.Fail
should be fine. Do you agree? If so I wouldn't mind doing this change for the existing test from which I copied this boilerplate as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
d8e4fa6
to
fdee29d
Compare
@Joerger can you have a look and buddy this next week? |
Hi @miguelvalerio, thank you for your contribution! I've opened a buddy PR to get this merged after internal review - #966 |
Description
We had a misconfiguration that broke the communication between the
event-handler
andteleport
, which in our case lead to the memory to quickly grow to unreasonable values.Upon further inspection, we believe that this was due to the fact that the
Events
call is creating a go-routine that is not terminating when the last event on a page was processed and callingfetch
returns an error. Since in our case the error returned to theEventsJob
was considered a connetion problem, this means thatEvents
would be called yet again, leading to go-routines that would not exit until the connetion issue was fixed being continuously spawned.This PR attempts to fix that
Additional Info
Memory:
Logs: