-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when sending events to listeners #198
Comments
Looking at the code a bit more, another option would be to have some recovery code in the goroutine that would report what happened, but not crash. |
@sybrandy thanks for reporting the issue. I wouldn't even say that this represents an atypical use case necessarily. There are possibly a number of consumers out there that dynamically hook on and off to event listeners over a longer period of (run-)time, so the risk to run into the problem you're describing is a function of time as well. And personally, I don't feel very content with race conditions knowingly lingering in the code base, no matter how rarely they may surface. At the very least, it's bad publicity. :-) I'd argue that recovering is probably not the best solution to the problem at hand as it mostly serves to conceal the bug we're talking about. A common pattern instead is to have a separate Again, thanks for bringing this to our attention! |
No problem. I agree with the Recovery solution. It's not my favourite, but I didn't want to discount it entirely. |
I did some preliminary analysis. Here are my findings:
At this point, I think we need a bit of communication, channel draining, or similar inside go-marathon in order to make sure we finish goroutines properly. On the consumer end though, calling WDYT? |
You hit every nail on the head. When I stated using unbuffered channels, I meant that you wouldn't send events via a goroutine. That would prevent the leak entirely at the expense of making sure that your clients used buffered channels. In that case, then I was able to remove the subscription, drain the events, and then close the channel. So far, it's working well, but I'm not sure if that's the road you want to go down. I believe you're on the right track. If go-marathon handled the creation/destruction of the channels itself, that would be best. So, here's a potentially crazy and rough thought: what if the listeners map is a map of Listener where Listener looks similar to this:
Now, sending an event to the channel could look like this:
Now, when you remove the subscription, you'll have it remove the listener from listeners, wait until all of the goroutines have finished, then close the channel. The reader should handle a closed channel properly. Obviously, this will need to be tested. Also, while I do understand the use of goroutines, there's also a cost that I'm not sure is worth it. If you control the channel creation, you can ensure it is buffered to minimize the risk of blocking, thus potentially eliminating the use of a goroutine. Now you can't leak goroutines because you're not creating them and you don't incur the overhead of spawning them, which could be a concern in very busy systems. Another thought is to perhaps have a event buffer that can be used to queue up events before they are sent to the channel. If that's part of the listener, each listener could have a reader that periodically checks for new messages in the buffer and pushes them onto a channel. A function would be used to put messages onto the buffer and if it's full, you could either drop the new message or drop the oldest message. It'll involve a lock, but that would prevent slow consumers from holding up other event listeners and prevent you from having to spawn a goroutine per event. I'll leave it up to you to choose which direction to go. I personally would go with the simplest solution, but it really depends on the risk of slow/non-responsive consumers. I hope this helps. |
First of all, thanks for exploring the solution space together with me and assessing the various possibilities. I really appreciate it! I spent some time reading the Go blog post on pipelines and cancellation again. It took me a bit to realize that it's describing the exact problem we're experiencing. The canonical solution is to introduce a separate done channel per listener and Another advantage is that consumers don't need to drain the channel. While the goroutine should probably stop the sending channel on reception of the done signal to account for a usage model where consumption of the events and deleting the event listener are decoupled (i.e., the user for-loops the event channel and expects it to stop when there's nothing more to send for whatever reason), there's no need for users to drain a channel strictly. All of this is probably more complex than asking users to maintain a buffered channel and eventually drain it. I'm still inclined to not follow this path though as it would add additional responsibilities on the client end and require users to pick an appropriate buffer size and/or drop events at some point. That's quite something if our primary intention is to "just" fix a race condition. I yet need to try out my approach but think it should work. Any thoughts? |
I should mention that with the approach described above, we probably can't have the consumer pass a channel of his own to That said, I don't think that passing in a buffered channel makes a lot of sense with the current implementation since we effectively achieve buffering through goroutines. |
Morning, Apologies...my weekends are busy, so I'm just getting to this now. First, everything you stated is correct in terms of using a done channel. Second, if possible, I'm all for ensuring we have the proper design vs. "just" fixing a race condition. That being said, my current hangup with the original design is the spawning of a goroutine for each event. To me, it is not a good design that I think could cause problems in the future. For example, what if the consumer simply stops reading for the channel? Sends will block and the code will continue to create more goroutines. That's the primary reason I was leaning towards the use of buffered channels. While we could still end up in a similar situation, I would be less catastrophic as we can handle the blocked sends with, IIRC, a select block with a timeout. E.g.
The code could then add some logic to remove the listener if it is deemed "dead" and continue to work. I have no idea what an appropriate timeout is, so that's something that needs figured out. Ultimately, I think regardless of what we do, the user of the library needs to be aware of what can happen if their consumer is slow so that they can handle it appropriately. As for consumers draining the messages, I don't believe it's all that uncommon of a practice. Looking at what I patched in, yes, it could be cleaner, but I was being fast. (I'll be fixing that this morning now that I realize how ugly it is.) However, in most cases, this is handled by a loop with a select in it or a for loop over a range. |
No need to apologize for not working on the weekends. I'm the strange one here. ;-) Thanks for your feedback. I fully agree that the solution at hand does not help bounding resource consumption in any regard as far as event handling is concerned. The goal of my PR is explicitly limited to fixing the race condition, however -- it should make things slightly better (stop the memory leaking) but not any worse. So I'd rather not conflate this problem with the risk of overflowing memory due to slow/halted event consumption. A solution to that problem probably involves comparing different approaches -- apart from channel buffering, there's also the possibility of creating and managing a pool of worker goroutines. We could even possibly integrate the timeout channel you have outlined in the current approach with the downside of not delivering events in stable order necessarily anymore. It's not exactly clear to me which route we'd want to go, so I think it should be up to a dedicated issue to discuss the matter if it's deemed important enough. With regards to draining, the primary concern I see is that users will need to be aware of and educated on the circumstance that they do need to drain. Ignoring/forgetting to do so may lead to some hard-to-debug cases, making me lean towards doing the hard work in the library up front as opposed to loading it off onto the user. But again, that's really a decision only to be made once we decide that lack of resource constraining is a problem. (For my organization, it hasn't been since the rate by which Marathon produces events is several times slower than our consumption rate. But who am I, that's just me. 😃 ) Feel free to file a new issue if you think it should be addressed. Thanks again! |
O.K. That's reasonable. When I get a chance, I'll file a new issue regarding the resource usage. Even here, I'm not seeing too much usage, but that probably won't always hold true and event then, it's probably a best practice to keep that under control just to make sure things to go haywire. |
I created #208 to discuss the resource usage. |
Some background: I modified Traefik, which uses this library, to periodically shutdown the Marathon client and create a new one. This is due to a bug in Marathon where Traefik will stop receiving events when an instance crashes. AFAIK, this hasn't been fixed yet and I don't know when, hence the work-around.
So, after getting this code working better, I started seeing panic from the below code where it tries to send on a closed channel:
The code I'm working on when it is told to stop will remove the listener, then close the channel it was using for updates. Looking at your code, this would appear to work fine until I realized that because you're sending the events to the channel within a goroutine, that what must be happening is the following:
Now, I do understand why the code is designed this way: you don't want to block other listeners from receiving events if one channel is full which will force the goroutine to block. However, I'm thinking it may be best to not use a goroutine in this case and document that buffered channels should be used instead. Perhaps a function could be added to create the correct channel vs. relying on the user of the API to make sure it's buffered?
I understand this may be an atypical use case, but I wanted to report this bug so that it was known before this affects production systems.
The text was updated successfully, but these errors were encountered: