Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadline exceeded errors are increasing during grpc client deployment #11860

Closed
cengozge opened this issue Jan 29, 2025 · 5 comments
Closed

Deadline exceeded errors are increasing during grpc client deployment #11860

cengozge opened this issue Jan 29, 2025 · 5 comments
Labels

Comments

@cengozge
Copy link

cengozge commented Jan 29, 2025

Versions on client side
io.grpc:grpc-netty-shaded: '1.68.1'
io.grpc:grpc-netty: '1.68.1'
io.grpc:grpc-stub: '1.68.1'
io.grpc:grpc-protobuf: '1.68.1'

Setting
timeout = 2000ms for deadline below

@Bean
  AuthGrpc.AuthBlockingStub authBlockingStub(GrpcClientChannelBuilderFactory factory) throws GeneralSecurityException, IOException {
    ManagedChannel managedChannel = factory
            .channel(hostName)
            .intercept(new LogInterceptor(), new GrpcTimeLimiterClientInterceptor(Duration.ofMillis(timeout)))
            .build();
    return AuthenticationGrpc.newBlockingStub(managedChannel);
  }
GrpcTimeLimiterClientInterceptor: 
  public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(MethodDescriptor<ReqT, RespT> method, CallOptions callOptions, Channel next) {
    return next.newCall(method, callOptions.withDeadline(Deadline.after(this.timeout.toMillis(), TimeUnit.MILLISECONDS)));
  }

Environment
Running one app using JDK23 as grpc client on openshift
Running one app as grpc server on openshift

Approximate load
~100 calls per second

Problem
During client app deployment, deadline exceeded errors are increasing and then decreasing, getting back to normal.

Expected
No deadline exceeded errors, calls can be sent and gets response from server.

Findings

  1. Calls are reaching server side with delay, ~ +2 +3 second delay. As far as we investigated, there is no network delay.
  2. Load balancer (not the K8s one) on server side is not equally loading pods, but this shouldn't be problem only during client deployment. If this was a problem, then client app would not have errors only during deployments. (I eliminated this one)
  3. I guess this delay is causing deadline exceeded errors

client log time before call is sent at 12:50:45,895
client log after response at 12:50:49,103
client log “CallOptions deadline exceeded” at 12:50:49.011 -> supposed to be 12:50:47?
server log when call received at 12:50:48.844 -> ~+3sec

Question
What could be the reason of increasing deadline exceeded errors? Could it be related to http/2 pooling or concurrent streams or something else that I couldn't find any clue on the web. Please comment if you need more code pieces/configs.

Thank you in advance.

@cengozge
Copy link
Author

Additional info: Errors are increasing on the new pods that are created during deployment.

@kannanjgithub
Copy link
Contributor

When connections are established to begin with there will be name resolution and connection establishment delays that may include TLS handshake. So the initial set of rpcs will face additional delays. Once connections have been established, they stay up, and any further rpcs don't have to wait for the connection establishment delays and will be faster. You can try increasing the timeout to a more practical value.

You can also try using the Round Robin load balancer on the gRPC client instead of the default PickFirst load balancer that always chooses the first available connection for a RPC.

Also if you are using maxConcurrentCallsPerConnection on the NettyServerBuilder it may potentially also be a factor during client startup when the connections are being created. You can try increasing this setting if you have it.

@cengozge
Copy link
Author

I thought to increase timeout to 3s but this affects deadlines when app is running, and it is 50% increase, which will affect circuit breaker to open after longer time, than when deadlines are 2s.
But the question is, isn't there a way to have a separate timeout for first connection and deadlines?
I will check maxConcurrentCalls setting, we don't set it, so it should be default.

@kannanjgithub
Copy link
Contributor

kannanjgithub commented Jan 31, 2025

isn't there a way to have a separate timeout for first connection and deadlines?

It is ultimately a RPC deadline regardless of whether it is because the connection didn't establish in time or if the RPC took a long time after successfully connecting to the server.
If you didn't set a value for maxConcurrentCalls then by default is has Integer.MAX_VALUE, so that is not the cause of your issue.

@kannanjgithub
Copy link
Contributor

I hope your question is answered. Please comment to reopen if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants