Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[fix](cloud) Fix cloud decomission lead to fe cant start (#46783)
Fix issue with SQL node decommissioning process The SQL node decommissioning process does not wait for transactions at the watermark level to complete before setting the backend's isDecommissioned status to true. As a result, the value displayed in show backends immediately reflects isDecommissioned regardless of ongoing transactions initiated via SQL. When a user calls drop be to remove a backend while there is only one backend in the cluster, the edit log logs the drop backend action, which removes the cluster information from memory. After dropping the backend, the previous transaction watermark process completes its tasks and attempts to modify the backend status, which requires accessing the cluster information. However, since the cluster information has already been deleted, this results in a null pointer exception (NPE) during the lookup in the FE memory map, causing the FE to crash. Additionally, the sequence of edit logs is fixed as follows: Edit log logs drop backend Edit log modifies backend FE fails to start up ``` 2025-01-10 05:46:15,070 ERROR (replayer|15) [EditLog.loadJournal():1251] replay Operation Type 91, log id: 10578 java.lang.NullPointerException: Cannot invoke "org.apache.doris.system.Backend.getCloudClusterName()" because "memBe" is null at org.apache.doris.cloud.system.CloudSystemInfoService.replayModifyBackend(CloudSystemInfoService.java:461) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:432) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.catalog.Env.replayJournal(Env.java:2999) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.catalog.Env$4.runOneCycle(Env.java:2761) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.common.util.Daemon.run(Daemon.java:119) ~[doris-fe.jar:1.2-SNAPSHOT] ```
- Loading branch information