Unsafe recovery partially fills key range hole #6859

overvenus · 2023-07-29T05:15:31Z

Bug Report

On a 4-node TiKV cluster, we stops two nodes and then starts unsafe recovery using pd-ctl.
After unsafe recovery, we find there are lots of PD server timeout, and it turns out there is
a region fails to be created.

Failed TiKV: tikv-0 and tikv-1
Alive TiKV: tikv-2 and tikv-3
Original region ID: 1965
New region ID: 2991

Timeline:

1965 on tikv-3 sends a snapshot to tikv-2.
Starts unsafe recovery.
Snapshot sent.
1965 on tikv-3 becomes tombstone.
A peer of 1965 is created on tikv-2.
PD sends to tikv-2 to create 2991 to cover the key rang of 1965.
2991 fails to be created because 1965 has been created on tikv-3.
PD considers unsafe recovery is finished.

There are actually two questions:

Why does PD finish unsafe recovery while there is a key rang hole?
Why does PD tombstone 1965 in the first place? Stoping two nodes out of
four nodes cluster should not lost replica data completely.

Note: the issue is found on a multi-rocksdb cluster. But I think it may affect single rocksdb cluster too.

Log:

What did you do?

See above.

What version of PD are you using (`pd-server -V`)?

v7.1.0

The text was updated successfully, but these errors were encountered:

v01dstar · 2023-08-15T07:35:32Z

Maybe not relevant, just for references, region 1965 received 1 vote from the dead store 1

[2023/07/28 07:54:59.375 +00:00] [INFO] [raft.rs:2230] ["received votes response"] [term=9] [type=MsgRequestVoteResponse] [approvals=2] [rejections=0] [f om=1967] [vote=true] [raft_id=1968] [peer_id=1968] [region_id=1965]

Members:

region_epoch { conf_ver: 59 version: 109 } peers { id: 1967 store_id: 1 } peers { id: 1968 store_id: 216 } peers { id: 2783 store_id: 45 }"] [legacy=false] [changes="[change_type: AddLearnerNode peer { id: 2990 store_id: 4 role: Learner }]"] [peer_id=1968] [region_id=1965]

v01dstar · 2023-08-15T08:21:21Z

I can't find any clue from the log.

I think the the snapshot related stuff was "ok" in this case, the key is to find out why PD decided to tombstone 1965 on store 216 (tikv3), this only happens when another newer region covers the range of 1965, but from the log, I could not find such regions.

@overvenus I suggest we add some info log in PD, print out any overlap regions while building the range tree. And wait for this problem occur again?

overvenus · 2023-08-17T05:03:04Z

Besides adding logs, can we check if all regions have quorum replicas alive before exiting unsafe recovery?

…6959) ref #6859 Add log for overlapping regions in unsafe recovery. We were unable to find the root cause of #6859, adding this log may help us better identify the issue, by printing out the regions that overlap with each other, that causes some of them to be marked as tombstone. Signed-off-by: Yang Zhang <[email protected]>

overvenus added the type/bug The issue is confirmed as a bug. label Jul 29, 2023

v01dstar mentioned this issue Aug 15, 2023

unsafe recovery: Add log for overlapping regions in unsafe recovery #6959

Merged

jebter added the severity/major label Aug 21, 2023

ti-chi-bot bot added may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Aug 21, 2023

bufferflies removed may-affects-5.2 may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Sep 8, 2023

ti-chi-bot added the affects-7.5 label Oct 23, 2023

ti-chi-bot added the affects-8.1 label Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsafe recovery partially fills key range hole #6859

Unsafe recovery partially fills key range hole #6859

overvenus commented Jul 29, 2023 •

edited

Loading

v01dstar commented Aug 15, 2023

v01dstar commented Aug 15, 2023

overvenus commented Aug 17, 2023

Unsafe recovery partially fills key range hole #6859

Unsafe recovery partially fills key range hole #6859

Comments

overvenus commented Jul 29, 2023 • edited Loading

Bug Report

What did you do?

What version of PD are you using (pd-server -V)?

v01dstar commented Aug 15, 2023

v01dstar commented Aug 15, 2023

overvenus commented Aug 17, 2023

overvenus commented Jul 29, 2023 •

edited

Loading

What version of PD are you using (`pd-server -V`)?