Spikes of exceptions with "error getting actor state: context deadline exceeded" #1175

Tarikan · 2023-10-24T12:13:53Z

Hello, we are using Dapr.Actors.AspNetCore version 1.11.0, Dapr version 1.10.5, and running in the AKS cluster, in which the k8s version is 1.25.5. For storing the state we are using Azure CosmosDB.
Recently, we started receiving alerts for 2 types of exceptions: DaprApiException and ActorInvocationException. All the exceptions have the same message "error getting actor state: context deadline exceeded". The length of the timeframe, in which all the exceptions occur, is always equal to or close to 15 minutes.
Spike only happens to one pod at a time. It means that we can have 3-5 pods with daprd and BE containers, but only one of them will suffer from DaprApiException.
ActorInvocationException can be thrown from lots of different pods during one spike, but services that throw this type of exception are stateless and have no affinities for requests. The exception is thrown in the actor's client object.
Daprd container logs do not contain any errors, except for timer invoking\executing failures.

CosmosDB metrics have shown that DB was 100% available through each 15-minute window. Although, we have observed spikes in max RU consumption from 30-40% up to 100% in the middle of the timeframe. Throttled requests have also been observed at the moment of RU spikes.

The issue has been observed every day through Oct 18-20

Could you please provide more info on what "context deadline exceeded" and if it is somehow related to CosmosDB? If it is indeed related to CosmosDB, then why did it only happen to 1 particular pod and not all of them?
Is "context deadline exceeded" just a timeout or it is something more complex?

halspang · 2023-11-30T00:05:27Z

@Tarikan - That error comes from the actual Dapr runtime, so you may get a better answer there. Essentially all it means is that the request to the statestore timed out. This could be related to the RU capacity or even just general networking slowness. Given that actors are transactional, if one of the requests/pods is handling a larger workload at that time, it seems plausible that they could get delayed and suffer from a timeout.

WhitWaldo added area/actor area/actor/runtime and removed area/actor labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spikes of exceptions with "error getting actor state: context deadline exceeded" #1175

Spikes of exceptions with "error getting actor state: context deadline exceeded" #1175

Tarikan commented Oct 24, 2023

halspang commented Nov 30, 2023

Spikes of exceptions with "error getting actor state: context deadline exceeded" #1175

Spikes of exceptions with "error getting actor state: context deadline exceeded" #1175

Comments

Tarikan commented Oct 24, 2023

halspang commented Nov 30, 2023