Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Cluster.Sharding: Shard can fail to HandOff indefinitely #7500

Open
Aaronontheweb opened this issue Feb 11, 2025 · 1 comment
Open

Akka.Cluster.Sharding: Shard can fail to HandOff indefinitely #7500

Aaronontheweb opened this issue Feb 11, 2025 · 1 comment

Comments

@Aaronontheweb
Copy link
Member

Version Information
Version of Akka.NET? v1.5.37
Which Akka.NET Modules? Akka.Cluster.Sharding

Describe the bug

This is a pretty rare bug as far as I can tell - today was the first time I've ever seen this log message ever get logged in 12 years of working with Akka.NET:

Log.Warning("{0}: Shard [{1}] deallocation didn't complete within [{2}].",
TypeName,
m.Shard,
Settings.TuningParameters.HandOffTimeout);

Looking more closely at the issue, we see A LOT of unhandled HandOff messages over the course of 10-30 minutes:

2025-02-11 12:49:24.376 [INFO][02/11/2025 18:49:24.376Z][Thread 0003][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [86] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:48:14.376 [INFO][02/11/2025 18:48:14.376Z][Thread 0007][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [44] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:47:54.367 [INFO][02/11/2025 18:47:54.367Z][Thread 0033][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [74] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:46:44.371 [INFO][02/11/2025 18:46:44.371Z][Thread 0033][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [46] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:45:34.369 [INFO][02/11/2025 18:45:34.369Z][Thread 0033][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [7] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:44:34.369 [INFO][02/11/2025 18:44:34.368Z][Thread 0016][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [48] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:43:24.370 [INFO][02/11/2025 18:43:24.370Z][Thread 0025][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [17] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:42:24.367 [INFO][02/11/2025 18:42:24.367Z][Thread 0003][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [62] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:39:44.364 [INFO][02/11/2025 18:39:44.364Z][Thread 0010][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [83] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:38:34.373 [INFO][02/11/2025 18:38:34.373Z][Thread 0023][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to [akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]#[REDACTED_ACTOR_ID]] was unhandled. [47] dead letters encountered. Message content: HandOff([REDACTED_SESSION_ID])
2025-02-11 12:37:34.361 [INFO][02/11/2025 18:37:34.361Z][Thread 0024][akka://[REDACTED_SYSTEM]/system/sharding/clientsessions/[REDACTED_SESSION_ID]] Message [HandOff] from [akka.tcp://[REDACTED_SYSTEM]@[REDACTED_HOST]:[REDACTED_PORT]/system/sharding/clientsessionsCoordinator/singleton/coordinator/[REDACTED_ACTOR_ID]] to 

This continues indefinitely.

To Reproduce

Not sure how to reproduce it yet.

Expected behavior

Shards should terminate their entities during a handoff and deallocate all entity actors.

Actual behavior

Not only did the shard not deallocate, but it looks like it didn't attempt to kill off any of its entity actors - otherwise the fail safe from the HandoffStopper should kick in:

Receive<StopTimeout>(_ =>
{
Log.Warning(
"{0}: HandOffStopMessage[{1}] is not handled by some of the entities in shard [{2}] after [{3}], " +
"stopping the remaining [{4}] entities.",
typeName, stopMessage.GetType().Name, shard, handoffTimeout, remaining.Count);
foreach (var r in remaining)
Context.Stop(r);
});

This didn't happen, so it makes me think that the Shard got behavior-switched to a state where it couldn't receive HandOff messages long before actually attempting to hand off.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
Are you running on Linux? Windows? Docker? Which version of .NET?

Additional context

  • Happened when scaling the sharding system up to double its original node count
  • Custom entity handoff message was used
@Aaronontheweb
Copy link
Member Author

In my call notes with the affected user I point out that this might be the "poisoned" behavior:

Context.Become(message =>
{
switch (message)
{
case Terminated t:
ReceiveTerminated(t.ActorRef);
return true;
}
return false;
});

But again, the fail-safe from the HandOffStopper should have kicked in if that were the case.

@Aaronontheweb Aaronontheweb modified the milestones: 1.5.38, 1.5.39 Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant