Potential goroutine leak when sending events to listeners (subscription.go) #208

sybrandy · 2016-09-27T12:54:53Z

This is different from #199 and was partially discussed in #198 .

The second of code in question currently looks like this:

// step: check if anyone is listen for this event
for channel, filter := range r.listeners {
    // step: check if this listener wants this event type
    if event.ID&filter != 0 {
        go func(ch EventsChannel, e *Event) {
            ch <- e
        }(channel, event)
    }
}

For each event, a new goroutine is created for the sole purpose of sending an event to a channel that a consumer is listening on. #198 describes a race condition that will be fixed, but there's still the issue of if the consumer is slow or stop consuming, idle goroutines will be created and consume resources. While this may not be a major issue, it's still something that should be prevented if possible as it's a poor practice to spawn goroutines that may/will never exit.

An initial attempt to fix #198 in a local project was to remove the use of a goroutine and simply rely on the consumer to ensure that the channel is buffered and properly drained. The current code using this library looks similar to this:

update := make(marathon.EventsChannel, 5)
go func(stop chan bool, client.marathon.Marathon, update marathon.EventsChannel) {
    <-stop
    client.RemoveEventsListener(update)
    close(update)
}
for event := range update {
    // Handle the event here.
}

After running for several days, this is working very well. Unfortunately, it suffers from a significant problem: if a consumer is very slow or stopped consuming, all listeners are blocked.

One thought to resolve this is to create one goroutine per listener. Each of these goroutines would have an internal buffer that would contain unsent events. Some management would be needed to keep the buffer in check. E.g. when full, drop the newest or oldest message. Now, we're down to 1 goroutine per listener vs. N goroutines per event.

Another solution that can work whether goroutines are used or not is to use a timeout when sending. This would ensure that goroutines are not permanent, but this can still be problematic because if the timeout is too short, events will be dropped if the consumer is a bit slow, but still consuming. If it's too long, then everything can slow down significantly if the events are not being consumed. Below is a snippet illustrating this idea:

select {
case eventChan <- event:
    fmt.Println("Event sent!")
case <-time.After(time.Second):
    fmt.Println("Event wasn't sent due to timeout sending the message.")
}

I've also toyed with the idea of having the consumer pass in a function/closure to the a method that will manage reading the events and calling the function, but I'm not sure where I sit on that. While I like the benefit of it simplifying things for the consumer, it may be overkill and I have a nagging feeling that it may result in consumers jumping through hoops under some circumstances.

In my case, the first solution is working very well, but not knowing all use cases, I don't believe it's ideal. The second solution that uses goroutines with buffers is probably the best all-around solution that I can think of right now, but it does introduce some complexity and may be overkill for many cases. Perhaps there's a better solution. I hope this sparks a discussion for the best solution possible.

One last note: in #198, I believe both @timoreimann and myself agree that perhaps the user should not pass in a channel but instead the subscription code should create the channel that will be used. This would help ensure that regardless of the solution used, the channel is correctly configured.

The text was updated successfully, but these errors were encountered:

timoreimann · 2016-09-28T11:15:03Z

As alluded to in the originating issue, I'm not exactly sure how much of a problem it is that we create goroutines without any limits. To me, Marathon brings a natural boundary in the sense that the number of events are somewhat proportional with the number of applications deployed/running. It probably takes a fairly big cluster to reach a level comparable to, say, web traffic.

The other aspect is who should be responsibility for limiting resource usage. Putting this on the go-marathon library is one way; another is to say that it's the duty of the client user to do buffering/throttling according to his/her specific needs. go-marathon would need some input from the user anyway (because it can't judge how big a buffer or how long a timeout should be), so one could argue that instead of pulling this kind of information from the user into the library and designing a sufficiently general solution, we should leave it with the client in the first place.

Speaking of design: We'd also need to decide if the limit should be defined globally, per subscriber, both, or either. The task of throttling could also be pushed back to Marathon by stopping to accept requests/read events, although I'm not sure how well Marathon would handle that. (I suppose it should be capable of that in order to protect itself from uncooperative subscribers.) Finally, note that goroutines are also spawned when using the older callback-based subscription model; if we wanted to limit things properly, we'd have to think of that as well. (A LimitListener would probably help here.) So there are a few things to be considered and weighted against to make sure the level of complexity introduced is really kept in balance with the value we want to get out of this.

That said, a common approach to the problem at hand found in the wild (1)(2) seems to be using channels as semaphores: Send an empty struct to a buffered channel of structs before you want to start a new goroutine, and receive a value once the goroutine has completed. This way, you are never allowed to have more goroutines than what the channel buffer size gives. Of course, it could be combined with discarding events if the goroutine pool is depleted.

sybrandy · 2016-09-29T12:36:39Z

First, understood that this may not be a big problem, but not knowing all use cases and the fact that consumers could stop receiving events suggests to me that things could get out of hand with the current design.

Second, agreed. That's one reason why I proposed having one goroutine per consumer with a fixed-sized buffer in them. With proper rules for dropping events, resource usage should be constant. However, you are right: what is a good size.

Third, I can't really comment on having Marathon do the throttling. I have no idea if it's even possible and I'm not sure if throttling one listener would affect other listeners.

Fourth, I didn't look at the callback code, so good catch there. Whatever solution we come up with probably should be applied there as well.

Fifth, that's an interesting approach. So, if we modify the current design to have a 10 goroutine limit, we'd have at most 10 goroutines running at any time. That's not bad. However, you still have the possibility of one consumer preventing other listeners from getting events if it is slow or dead. Unfortunately, I can't think of a good solution that prevents one consumer from blocking the others, but also doesn't require us mucking around with buffering internal to the library.

Perhaps we need to answer a few questions:

Is it important to get every event? Or, can we drop events until the listener is ready for a new event?
Do we care if listeners are blocked?
If we can come up with a good solution for knowing if a listener is "stopped", should we drop it so that other running listeners can continue to work?

IMHO, I'm currently thinking that if we isolate the listeners from each other and used a reasonably sized buffered channel per listener, with the ability to discard events if the channel is full, that would work very well. If it's a reasonable default size, I don't think people will care too much. 10? 20? 50? Regardless, the consumer reads from the channel and does it's thing. If the consumer is truly that slow, then perhaps it should be on the implementer to buffer if they truly care about every event. Having a small buffer just ensures that if they're a little slow, they won't necessarily miss much. This would also work if the channel isn't buffered. It just means that more events would be dropped.

timoreimann · 2016-10-06T09:57:18Z

Sorry for the long delay. Life in the closed-source world was keeping me busy. :-)

You raised a number of good question. I can extend the list by a few more:

What strategy should be followed if the buffer is full? Options coming to my mind are drop oldest, drop youngest, or drop any.
Do we want to prescribe buffers to all users? What about those cases where users don't want or need buffers? (This could be containerized environments where resource consumption is usually constrained and enforced by an orchestration system anyway)
is one buffer per listener desirable for all users? While it's certainly helpful in isolating users, it also makes limiting on a global level across all users harder.
Do we want to enable users to get notified when the buffer is full? What about when it reaches a certain percentage? If we don't, then users might not be aware they are running out of buffer space.

That's a good number of design choices to be made. We'd either have to build a pretty flexible system or make assumptions. Going with the first option means we add a fair amount of complexity, while the latter means we may not hit the right use cases.

I think that we shouldn't go down either route. It seems much simpler and generic to just make sure that go-marathon can deliver all events as quickly as possible (which we hopefully do already by spawning one goroutine per deliverable event) and leave it to the consumer to build in further reliability and/or throttling measures. A better, more accurate assessment could possibly be made if we had a good number of concrete use cases to work with. We don't have these, however, and thus my feeling tells me not to rush into action but stick with the status quo.

That's my 2 cents. Happy to hear your further thoughts on this. Maybe @gambol99 wants to chime in as well.

sybrandy · 2016-10-06T12:35:15Z

I understand your concern. Unfortunately, I only have the one use case and I see a lot of events coming through our system. It's in batches and a lot of them are health checks, but we do run over 2K containers that do get restarted periodically and at different rates.

Thinking about this, I think I may have a relatively sane solution. Let's say we have an interface that looks something like this (please forgive any typoes/naming mistakes):

type EventBuffer interface {
    Send(e Event) error
    Get() (Event, error)
    Done() bool
}

So, when an Event listener is added, an EventBuffer is created that handles the buffering of the events. Events are sent to the EventBuffer and the client gets events from the EventBuffer. E.g.

// Send an event
for buffer, filter := range r.listeners {
    // step: check if this listener wants this event type
    if event.ID&filter != 0 {
        buffer.Send(event)
    }
}

// Get an event
for {
    if eb.Done() {
        break
    }
    event, err := eb.Get()
    // Do something with the event here...
}

And yes, I do think more thought needs to be put into the design of the interface. It's just a quick example.

So, what does that give us? A reasonable default as well as a few alternatives can be provided. For example, perhaps the default the status quo: every time an event comes in, the buffer creates a new goroutine that attempts to write to a channel that is read from by the Get() method. A version that uses a buffer could also be created that would allow a user to configure the size of the buffer and which messages to drop when it's full or simply block until there's room.

I think that using an interface allows for a reasonable default to be provided while still allowing a user to easily override it. Not everyone is going to care too much about how the buffering works, but there are going to be people who need to be able to tune their code.

This doesn't necessarily solve the issue you have with global resource usage unless you think that we use one EventBuffer "object" to handle multiple listeners. E.g. when I create a new subscription, I define what type of event buffer to use and all listeners use the same instance of that EventBuffer.

Thoughts? I'm trying to find that middle-ground where things are kept simple, but not too simple where any bottlenecks or undesirable behaviour cannot be fixed or overridden without forking the code.

timoreimann · 2016-10-26T19:20:51Z

Your design sketch does seem reasonable on first sight. I am still concerned though that we invest time and energy into something which might not be a problem for too many users. Maybe that's because people are normally not running that many containers, or they have found alternatives to deal with the problem (like doing the throttling on the consumer side). I wouldn't even mind if we build out this feature to a certain point if we knew about a concrete user suffering from the problem and reasoning why go-marathon should help alleviate the pain. The only thing we seem to have though is vague ideas of what users may want.

You mention that you are running 2,000 containers in your environment. May I ask if the problem described here has hit you already? If so, to what extent? You also mention there's a fair number of health check events involved. Is it feasible for you to filter those and any other events you're not interested in out using the filter parameter in go-marathon in order to reduce the overall load?

Please don't get me wrong: I do see the potential risk you're outlining and the value that a well-designed solution integrated into go-marathon could bring. It's the uncertainty of what requirements we would need to build against which makes me think that holding our feet still might be the better choice, at least for now.

Would love to hear what @gambol99 has to say on this matter. Having some kind of majority would help moving into either direction.

sybrandy · 2016-10-27T11:46:25Z

Hello,

I have seen the number of goroutines climb significantly in the code that I'm running, however I don't know if it got to the point of causing any issues as there was another error that I had fixed later that was the root cause of the problems I was seeing. As for filtering out the health checks, that's one of the things the code looks at.

Unfortunately, that's all I have. I don't know anyone else who has a significantly large Mesos cluster, so I can't say that this is a problem elsewhere. However, I'm still of the opinion that it's better to fix this now than to get burned by it later.

Lastly, I think you're right: we should see what @gambol99 has to say about this. Ultimately this is his project.

Add goroutine leak testing to tests to probe for goroutine leakage, as mentioned in gambol99#208

sybrandy mentioned this issue Sep 27, 2016

Race condition when sending events to listeners #198

Closed

msabramo added a commit to msabramo/go-marathon that referenced this issue May 30, 2018

Test for goroutine leaks

7da8f94

Add goroutine leak testing to tests to probe for goroutine leakage, as mentioned in gambol99#208

msabramo mentioned this issue May 30, 2018

Test for goroutine leaks #342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential goroutine leak when sending events to listeners (subscription.go) #208

Potential goroutine leak when sending events to listeners (subscription.go) #208

sybrandy commented Sep 27, 2016

timoreimann commented Sep 28, 2016

sybrandy commented Sep 29, 2016

timoreimann commented Oct 6, 2016 •

edited

Loading

sybrandy commented Oct 6, 2016

timoreimann commented Oct 26, 2016

sybrandy commented Oct 27, 2016

Potential goroutine leak when sending events to listeners (subscription.go) #208

Potential goroutine leak when sending events to listeners (subscription.go) #208

Comments

sybrandy commented Sep 27, 2016

timoreimann commented Sep 28, 2016

sybrandy commented Sep 29, 2016

timoreimann commented Oct 6, 2016 • edited Loading

sybrandy commented Oct 6, 2016

timoreimann commented Oct 26, 2016

sybrandy commented Oct 27, 2016

timoreimann commented Oct 6, 2016 •

edited

Loading