-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: invoke reconnect when readInternal return error #188
Conversation
Could you please merge this PR? Let me know if this works for you! |
Hey @itgram, thanks for contribution. Why that change is needed in the first place? Could you extend on the reason a little bit? |
In a cluster mode setup of EventStoreDB, when one of the cluster nodes drops and we attempt to use This issue occurs here: EventStore-Client-Go/esdb/client.go Line 291 in 8206ac8
If we call ReadStream again, it fails repeatedly because client.grpcClient.handleError is not invoked in case of an error, which prevents the reconnection process from being triggered. |
Could you come up with a test that shows your patch is fixing the behavior you are talking about? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
As-is, I don't think your test reproduces the scenario you described earlier. I'd really like to replicate all the steps that lead to your issue before applying your patch.
I couldn’t find a way to stop the leader node of EventStoreDB programmatically from the code. |
If you already know the leader node of the cluster, what you can do is to call those endpoints on it, the order matters:
This will not stop the leader per se but rather trigger a new election cycle and guarantee (in the context of your test) that a new a leader is elected. |
Interesting, but as you know, I still haven't been able to determine the leader node programmatically yet. |
The best way to achieve this is to leverage the {
"members": [
{
"instanceId": "cf2e423a-a664-43ed-806f-10745c6d0ea0",
"timeStamp": "2024-10-28T00:53:05.55955Z",
"state": "Leader",
"isAlive": true,
"internalTcpIp": "127.0.0.1",
"internalTcpPort": 1112,
"internalSecureTcpPort": 0,
"externalTcpIp": "127.0.0.1",
"externalTcpPort": 0,
"externalSecureTcpPort": 0,
"internalHttpEndPointIp": "127.0.0.1",
"internalHttpEndPointPort": 2113,
"httpEndPointIp": "127.0.0.1",
"httpEndPointPort": 2113,
"lastCommitPosition": 891,
"writerCheckpoint": 1063,
"chaserCheckpoint": 1063,
"epochPosition": 0,
"epochNumber": 0,
"epochId": "ded879b8-d7e6-45a6-87b2-2f3ef8c1f47b",
"nodePriority": 0,
"isReadOnlyReplica": false,
"esVersion": "24.10.0-prerelease"
},
{...}
],
} The leader node will have its |
Hey @itgram I came up with a test case that I believe replicates the steps that you described earlier: func readStreamNotLeaderExceptionButWorkAfterRetry(t *testing.T) {
db := CreateClient("esdb://admin:changeit@localhost:2111,localhost:2112,localhost:2113?nodepreference=leader&tlsverifycert=false", t)
defer db.Close()
ctx := context.Background()
streamID := NAME_GENERATOR.Generate()
_, err := db.AppendToStream(ctx, streamID, esdb.AppendToStreamOptions{}, createTestEvent())
assert.Nil(t, err)
members, err := db.Gossip(ctx)
assert.Nil(t, err)
for _, member := range members {
if member.State != gossip.MemberInfo_Leader {
continue
}
url := fmt.Sprintf("https://%s:%d/admin/shutdown", member.HttpEndPoint.Address, member.HttpEndPoint.Port)
req, err := http.NewRequest("POST", url, nil)
assert.Nil(t, err)
req.SetBasicAuth("admin", "changeit")
client := &http.Client{
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
}
resp, err := client.Do(req)
assert.Nil(t, err)
resp.Body.Close()
break
}
_, err = db.ReadStream(ctx, streamID, esdb.ReadStreamOptions{
From: esdb.Start{},
RequiresLeader: true,
}, math.MaxUint64)
assert.Nil(t, err)
} The func (client *Client) Gossip(ctx context.Context) ([]*gossip.MemberInfo, error) {
handle, err := client.grpcClient.getConnectionHandle()
if err != nil {
return nil, err
}
gossipClient := gossip.NewGossipClient(handle.Connection())
clusterInfo, err := gossipClient.Read(ctx, &shared.Empty{})
if err != nil {
return nil, err
}
return clusterInfo.Members, nil
} I wasn’t able to reproduce the issue on my end. While I can’t rule out the possibility of a bug, it may help to explore other potential causes. Do you have any additional leads or context that could help narrow down the source? |
@YoEight |
Hello @YoEight, do you have any concerns? |
Hey @itgram, I haven’t forgotten about your PR! I’m finishing up a few other tasks first, then I’ll jump into reviewing it. Apologies for the delay, my time has been tighter than usual as we’re gearing up for the upcoming release. Thanks for your patience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm your patch fixed the issue, nice work!
Co-authored-by: Yo Eight <[email protected]>
Fixed: Trigger a reconnect when readInternal returns an error.