Deadline exceeded errors are increasing during grpc client deployment #11860

cengozge · 2025-01-29T12:52:09Z

Versions on client side
io.grpc:grpc-netty-shaded: '1.68.1'
io.grpc:grpc-netty: '1.68.1'
io.grpc:grpc-stub: '1.68.1'
io.grpc:grpc-protobuf: '1.68.1'

Setting
timeout = 2000ms for deadline below

@Bean
  AuthGrpc.AuthBlockingStub authBlockingStub(GrpcClientChannelBuilderFactory factory) throws GeneralSecurityException, IOException {
    ManagedChannel managedChannel = factory
            .channel(hostName)
            .intercept(new LogInterceptor(), new GrpcTimeLimiterClientInterceptor(Duration.ofMillis(timeout)))
            .build();
    return AuthenticationGrpc.newBlockingStub(managedChannel);
  }

GrpcTimeLimiterClientInterceptor: 
  public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
    return next.newCall(method, callOptions.withDeadline(Deadline.after(this.timeout.toMillis(), TimeUnit.MILLISECONDS)));
  }

Environment
Running one app using JDK23 as grpc client on openshift
Running one app as grpc server on openshift

Approximate load
~100 calls per second

Problem
During client app deployment, deadline exceeded errors are increasing and then decreasing, getting back to normal.

Expected
No deadline exceeded errors, calls can be sent and gets response from server.

Findings

Calls are reaching server side with delay, ~ +2 +3 second delay. As far as we investigated, there is no network delay.
Load balancer (not the K8s one) on server side is not equally loading pods, but this shouldn't be problem only during client deployment. If this was a problem, then client app would not have errors only during deployments. (I eliminated this one)
I guess this delay is causing deadline exceeded errors

client log time before call is sent at 12:50:45,895
client log after response at 12:50:49,103
client log “CallOptions deadline exceeded” at 12:50:49.011 -> supposed to be 12:50:47?
server log when call received at 12:50:48.844 -> ~+3sec

Question
What could be the reason of increasing deadline exceeded errors? Could it be related to http/2 pooling or concurrent streams or something else that I couldn't find any clue on the web. Please comment if you need more code pieces/configs.

Thank you in advance.

The text was updated successfully, but these errors were encountered:

cengozge · 2025-01-30T07:58:19Z

Additional info: Errors are increasing on the new pods that are created during deployment.

kannanjgithub · 2025-01-30T14:56:15Z

When connections are established to begin with there will be name resolution and connection establishment delays that may include TLS handshake. So the initial set of rpcs will face additional delays. Once connections have been established, they stay up, and any further rpcs don't have to wait for the connection establishment delays and will be faster. You can try increasing the timeout to a more practical value.

You can also try using the Round Robin load balancer on the gRPC client instead of the default PickFirst load balancer that always chooses the first available connection for a RPC.

Also if you are using maxConcurrentCallsPerConnection on the NettyServerBuilder it may potentially also be a factor during client startup when the connections are being created. You can try increasing this setting if you have it.

cengozge · 2025-01-30T15:53:17Z

I thought to increase timeout to 3s but this affects deadlines when app is running, and it is 50% increase, which will affect circuit breaker to open after longer time, than when deadlines are 2s.
But the question is, isn't there a way to have a separate timeout for first connection and deadlines?
I will check maxConcurrentCalls setting, we don't set it, so it should be default.

kannanjgithub · 2025-01-31T15:35:12Z

isn't there a way to have a separate timeout for first connection and deadlines?

It is ultimately a RPC deadline regardless of whether it is because the connection didn't establish in time or if the RPC took a long time after successfully connecting to the server.
If you didn't set a value for maxConcurrentCalls then by default is has Integer.MAX_VALUE, so that is not the cause of your issue.

kannanjgithub · 2025-02-07T05:38:47Z

I hope your question is answered. Please comment to reopen if required.

cengozge added the question label Jan 29, 2025

kannanjgithub closed this as completed Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadline exceeded errors are increasing during grpc client deployment #11860

Deadline exceeded errors are increasing during grpc client deployment #11860

cengozge commented Jan 29, 2025 •

edited

Loading

cengozge commented Jan 30, 2025

kannanjgithub commented Jan 30, 2025

cengozge commented Jan 30, 2025

kannanjgithub commented Jan 31, 2025 •

edited

Loading

kannanjgithub commented Feb 7, 2025

Deadline exceeded errors are increasing during grpc client deployment #11860

Deadline exceeded errors are increasing during grpc client deployment #11860

Comments

cengozge commented Jan 29, 2025 • edited Loading

cengozge commented Jan 30, 2025

kannanjgithub commented Jan 30, 2025

cengozge commented Jan 30, 2025

kannanjgithub commented Jan 31, 2025 • edited Loading

kannanjgithub commented Feb 7, 2025

cengozge commented Jan 29, 2025 •

edited

Loading

kannanjgithub commented Jan 31, 2025 •

edited

Loading