Expected Behavior of Cluster Providers with Inaccessible Members #2034

sfmskywalker · 2023-08-26T09:12:21Z

sfmskywalker
Aug 26, 2023

Hello Proto.Actor community,

I'm currently working with the Azure Container Apps cluster member provider and have encountered a scenario I'd like to clarify.

When the cluster provider fetches members from its internal member store, there's a possibility that some of these members might no longer be accessible (e.g., a revision/pod that has been terminated). In my observations, the Proto.Remote.ServerConnector repeatedly attempts to establish a connection to such members, but eventually discontinues due to connection failures.

This leads me to my main question:

Is it the inherent responsibility of the cluster provider to ensure that it only returns members that are currently online?
Should the cluster provider actively monitor these members and update the cluster member list if any of them go offline?
I'm trying to understand if this behavior is by design or if there's an expectation for the cluster provider to manage the member list more proactively.

Thank you for your insights!

sfmskywalker · 2023-08-26T09:44:23Z

sfmskywalker
Aug 26, 2023
Author

After delving deeper into the Proto.Actor ecosystem, I took a closer look at the Kubernetes Cluster Provider. I observed that its member monitor appears to be functioning as expected: it provides a list of cluster members that are actively running. Furthermore, it seems to rely on Kubernetes itself to furnish information regarding the status of the pods, whether they are running or not.

Drawing a parallel, it might be beneficial for the Azure Container Apps cluster provider to adopt a similar approach. Perhaps, in addition to relying on Azure's internal member store, the provider could implement a mechanism to ping each member to verify its reachability. This would ensure a more accurate representation of active members and potentially mitigate issues with attempting connections to inaccessible members.

Would love to hear thoughts on this approach or if there are other considerations I might be overlooking.

2 replies

sfmskywalker Aug 26, 2023
Author

Building on my previous observations, I've been contemplating alternative approaches to ensure the accuracy of the cluster member list. One potential solution that comes to mind involves leveraging the EndpointTerminatedEvent.

Instead of verifying a member's availability during metadata retrieval from sources like Azure Resource Tags or Redis, we could have the cluster provider actively observe the EndpointTerminatedEvent. This event is published to the system event stream whenever the ServerConnector encounters a connection failure to a specific member.

By monitoring this event, the ACA cluster provider could promptly identify and remove any inaccessible members from its list. Subsequently, it can update the cluster member list to reflect these changes, ensuring a more real-time and accurate representation of active members.

Although I think this might work, I'm wondering if it would be possible to leverage some of the existing APIs to try and connect to a member imperatively.

I'd be keen to hear feedback on this idea and whether there are potential pitfalls or challenges I might not have considered.

sfmskywalker Aug 26, 2023
Author

I update my local version of the ACA cluster provider here to show what I mean: https://github.com/elsa-workflows/elsa-core/blob/v3/src/modules/Elsa.ProtoActor.Cluster.AzureContainerApps/AzureContainerAppsProvider.cs#L141-L153

sfmskywalker · 2023-08-28T19:05:14Z

sfmskywalker
Aug 28, 2023
Author

Proto.Actor Cluster Providers - Insights

I had a very insightful conversation on the Slack channel regarding the behavior of cluster providers, specifically around managing inaccessible members. I thought it would be beneficial to share a summarized version here for the broader community's benefit.

Summary:

Expected Behavior: Cluster providers currently don't actively monitor services for liveness. Evictions are based on members that don't check in within a certain period.
Gossip Protocol: This protocol encompasses liveness checks but operates independently of the cluster provider.
Membership Management: Cluster providers should maintain the member set. Discussions about the ACA provider highlighted the need to determine if retrieved member metadata is stale. A suggestion arose to possibly monitor members through separate actors.
Staleness & Timestamping: Introducing timestamps to records can help distinguish stale entries. If not updated over a time, they can be evicted.
Gossip Features: When enabled, HeardbeatExpiration and ClusterHealthcheck offer various tools to manage cluster liveness and handle blocked nodes.
Heartbeats: There's a broader understanding of the importance of monitoring liveness. The ACA provider could potentially utilize an API to check if a revision is still active.
Kubernetes vs. ACA: The need to maintain an external state store arose because ACA, layered on Kubernetes, doesn't natively offer certain features.
ECS Provider: Their approach brought attention to the issue of stale member data.
Solution: One method suggests the cluster provider updating its member entry's timestamp regularly. This can ensure an active member list.
PR & Implementation: Based on our discussions, I plan to refactor and implement these changes, followed by a PR.
Reentrancy in Actors: For actors interacting with external services where operations might exceed the typical response time, using context.ReenterAfter was suggested.

A big shoutout to everyone on Slack for the insights and feedback. Super helpful stuff. For those keen on diving deeper, the Slack thread has all the nitty-gritty details.

Cheers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected Behavior of Cluster Providers with Inaccessible Members #2034

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Expected Behavior of Cluster Providers with Inaccessible Members #2034

sfmskywalker Aug 26, 2023

Replies: 2 comments · 2 replies

sfmskywalker Aug 26, 2023 Author

sfmskywalker Aug 26, 2023 Author

sfmskywalker Aug 26, 2023 Author

sfmskywalker Aug 28, 2023 Author

sfmskywalker
Aug 26, 2023

Replies: 2 comments 2 replies

sfmskywalker
Aug 26, 2023
Author

sfmskywalker Aug 26, 2023
Author

sfmskywalker Aug 26, 2023
Author

sfmskywalker
Aug 28, 2023
Author