v7.5.0 After the "pd remove" operation, remnants persist. #8051

vzong · 2024-04-10T09:25:15Z

We have a tikv cluster, architecture 2:2:2. PD gourp loss leader, here are the relevant logs for the PD leader, the probable cause appears to be that meta data can’t be written to etcd. Is there any other way to troubleshoot the issue?

12:04:40, The network experienced a temporary outage.

13:48:55, write meta data failed.

TSO gap

Except for the leader, other nodes have only a small amount of logs: region sync with leader meet error.

vzong · 2024-04-11T03:18:17Z

The IP address 10.25.248.131:2380 (VMS584328), which previously belonged to the tikv-oversea cluster, underwent a scale-in operation on 2024/04/08 at 10:19:26. The tiup cluster display command for tikv-oversea has shown that 10.25.248.131:2380 has been removed, and subsequently, the server VMS584328 was taken offline. However, the pd.log indicates that the tikv-oversea cluster is still attempting to connect to 10.25.248.131:2380 and is reporting connection errors, which persisted until 2024/04/10.

On 2024/04/10, a new server VMS602679 was brought online, and its IP address, 10.25.248.131, was reused. At 13:47 on the same day, 10.25.248.131:2380 (VMS602679) was scaled out to the tikv-dal-test cluster, which transitioned the tikv-dal-test cluster into a 3+1 configuration. During this time, the six nodes of the tikv-oversea cluster also reconnected to 10.25.248.131:2380, forming a 6+1 configuration. Following this, the combination of 3+1+6, all ten pd nodes were connected, creating a ten-node pd cluster. At this point, data confusion occurred.

tikv-oversea
10.109.220.10:2379
10.109.220.9:2379
10.25.248.208:2379
10.25.248.246:2379
10.58.228.76:2379
10.58.228.86:2379

tikv-dal-test
10.58.228.37
10.109.216.124
10.25.248.212

tikv-oversea pd log：
[2024/04/07 18:37:25.977 +08:00] [INFO] [etcdutil.go:309] ["update endpoints"] [num-change=7->8] [last-endpoints="[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.9:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.25.248.131:2379,http://10.25.249.164:2379]"] [endpoints="[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.109.220.9:2379,http://10.25.248.131:2379,http://10.25.249.164:2379,http://10.25.248.208:2379]"]
[2024/04/08 10:19:26.254 +08:00] [INFO] [cluster.go:422] ["removed member"] [cluster-id=468758231b5b0393] [local-member-id=edff54aa33575887] [removed-remote-peer-id=f67c161a4e9b9cb8] [removed-remote-peer-urls="[http://10.25.248.131:2380]"]
[2024/04/08 10:19:27.958 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting..."]
[2024/04/08 10:19:27.958 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting..."]
......
[2024/04/09 14:46:33.395 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection timed out". Reconnecting..."]
[2024/04/09 14:49:25.265 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: i/o timeout". Reconnecting..."]
......
[2024/04/10 13:44:05.323 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting..."]
[2024/04/10 13:45:57.545 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting..."]
[2024/04/10 13:46:21.890 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting..."]
[2024/04/10 13:47:58.088 +08:00] [INFO] [etcdutil.go:309] ["update endpoints"] [num-change=6->7] [last-endpoints="[http://10.25.248.246:2379,http://10.58.228.76:2379,http://10.25.248.208:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379]"] [endpoints="[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379,http://10.25.248.208:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]"]
[2024/04/10 13:48:08.085 +08:00] [INFO] [etcdutil.go:309] ["update endpoints"] [num-change=6->7] [last-endpoints="[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.9:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.25.248.208:2379]"] [endpoints="[http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.25.248.208:2379,http://10.58.228.76:2379,http://10.109.220.9:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]"]
[2024/04/10 13:48:18.090 +08:00] [INFO] [etcdutil.go:309] ["update endpoints"] [num-change=7->10] [last-endpoints="[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379,http://10.25.248.208:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]"] [endpoints="[http://10.109.220.10:2379,http://10.58.228.76:2379,http://10.109.220.9:2379,http://10.58.228.86:2379,http://10.109.216.124:2379,http://10.25.248.212:2379,http://10.25.248.208:2379,http://10.58.228.37:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]"]

vzong · 2024-04-11T03:21:02Z

Subsequently, I attempted to find a way to cancel the reconnection of the removed pd node. I discovered that changing the pd leader did not work; only by reload the pd could the error be eliminated.

reload pd:
[2024/04/11 10:55:45.584 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.208:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.208:2379: connect: connection refused". Reconnecting..."]
[2024/04/11 10:57:25.508 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.208:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.208:2379: connect: connection refused". Reconnecting..."]
[2024/04/11 10:57:44.679 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.208:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.208:2379: connect: connection refused". Reconnecting..."]
[2024/04/11 10:59:27.394 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.208:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.208:2379: connect: connection refused". Reconnecting..."]
[2024/04/11 10:59:32.191 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {http://10.25.248.208:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.208:2379: connect: connection refused". Reconnecting..."]
[2024/04/11 11:00:04.747 +08:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.109.220.10:2379->10.25.248.208:35280: use of closed network connection"]
[2024/04/11 11:00:04.747 +08:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.109.220.10:2379->10.25.248.208:41894: use of closed network connection"]
[2024/04/11 11:00:06.142 +08:00] [INFO] [cluster.go:256] ["recovered/added member from store"] [cluster-id=468758231b5b0393] [local-member-id=258d03d7c6143318] [recovered-remote-peer-id=632ce496e06d670b] [recovered-remote-peer-urls="[http://10.25.248.208:2380]"]
[2024/04/11 11:00:06.149 +08:00] [INFO] [transport.go:327] ["added remote peer"] [local-member-id=258d03d7c6143318] [remote-peer-id=632ce496e06d670b] [remote-peer-urls="[http://10.25.248.208:2380]"]
[2024/04/11 11:00:06.179 +08:00] [INFO] [cluster.go:422] ["removed member"] [cluster-id=468758231b5b0393] [local-member-id=258d03d7c6143318] [removed-remote-peer-id=632ce496e06d670b] [removed-remote-peer-urls="[http://10.25.248.208:2380]"]

rleungx · 2024-04-15T06:22:18Z

Why do you need such a number of PDs?

qiaohao9 · 2024-07-16T02:09:01Z

I reproduced this. What's the solution so far?

reproduced steps:

Creating a pd cluster in 2(pd1,pd2)-2(pd3,pd4)-1(pd5)
Using pd-ctl member delete name xxx to delete two members(pd2, pd3), and now it's 1-1-1
The cluster will still try to connect pd2 and pd3

As I understand it, these two nodes have already been scaled-in, and there should be no attempt to reconnect the scaled-in node

rleungx · 2024-07-16T02:40:42Z

Is the pd2 and pd3 still in the output of pd-ctl member?

qiaohao9 · 2024-07-16T02:44:22Z

Is the pd2 and pd3 still in the output of pd-ctl member?

Neither pd2 or pd3 are shown in pd-ctl member.

rleungx · 2024-07-16T03:38:14Z

Which tool do you use to manage the cluster, tiup or tidb-operator?

rleungx · 2024-07-16T03:39:52Z

Does your start script still contain those deleted member?

qiaohao9 · 2024-07-16T07:25:03Z

Which tool do you use to manage the cluster, tiup or tidb-operator?

Instead of using tiup to deploy, I just deployed manually using docker.

qiaohao9 · 2024-07-19T06:46:02Z

Maybe it has something to do with this etcd upgrade, which was reproduced with 7.1.5 but not 7.1.4

vzong added the type/bug The issue is confirmed as a bug. label Apr 10, 2024

vzong changed the title ~~pd loss leader v7.5~~ v7.5.0 After the "pd remove" operation, remnants persist. Apr 11, 2024

jebter added the severity/major label Aug 7, 2024

ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v7.5.0 After the "pd remove" operation, remnants persist. #8051

v7.5.0 After the "pd remove" operation, remnants persist. #8051

vzong commented Apr 10, 2024 •

edited

Loading

vzong commented Apr 11, 2024

vzong commented Apr 11, 2024

rleungx commented Apr 15, 2024

qiaohao9 commented Jul 16, 2024

rleungx commented Jul 16, 2024

qiaohao9 commented Jul 16, 2024 •

edited

Loading

rleungx commented Jul 16, 2024

rleungx commented Jul 16, 2024

qiaohao9 commented Jul 16, 2024

qiaohao9 commented Jul 19, 2024

v7.5.0 After the "pd remove" operation, remnants persist. #8051

v7.5.0 After the "pd remove" operation, remnants persist. #8051

Comments

vzong commented Apr 10, 2024 • edited Loading

vzong commented Apr 11, 2024

vzong commented Apr 11, 2024

rleungx commented Apr 15, 2024

qiaohao9 commented Jul 16, 2024

rleungx commented Jul 16, 2024

qiaohao9 commented Jul 16, 2024 • edited Loading

rleungx commented Jul 16, 2024

rleungx commented Jul 16, 2024

qiaohao9 commented Jul 16, 2024

qiaohao9 commented Jul 19, 2024

vzong commented Apr 10, 2024 •

edited

Loading

qiaohao9 commented Jul 16, 2024 •

edited

Loading