Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branch-3.0: [fix](cloud) Fix cloud decomission lead to fe cant start #46783 #46863

Open
wants to merge 1 commit into
base: branch-3.0
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

Cherry-picked from #46783

Fix issue with SQL node decommissioning process

The SQL node decommissioning process does not wait for transactions at
the watermark level to complete before setting the backend's
isDecommissioned status to true.

As a result, the value displayed in show backends immediately reflects
isDecommissioned regardless of ongoing transactions initiated via SQL.

When a user calls drop be to remove a backend while there is only one
backend in the cluster, the edit log logs the drop backend action, which
removes the cluster information from memory.

After dropping the backend, the previous transaction watermark process
completes its tasks and attempts to modify the backend status, which
requires accessing the cluster information. However, since the cluster
information has already been deleted, this results in a null pointer
exception (NPE) during the lookup in the FE memory map, causing the FE
to crash.

Additionally, the sequence of edit logs is fixed as follows:

Edit log logs drop backend
Edit log modifies backend
FE fails to start up


```
2025-01-10 05:46:15,070 ERROR (replayer|15) [EditLog.loadJournal():1251] replay Operation Type 91, log id: 10578
java.lang.NullPointerException: Cannot invoke "org.apache.doris.system.Backend.getCloudClusterName()" because "memBe" is null
        at org.apache.doris.cloud.system.CloudSystemInfoService.replayModifyBackend(CloudSystemInfoService.java:461) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:432) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env.replayJournal(Env.java:2999) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env$4.runOneCycle(Env.java:2761) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119) ~[doris-fe.jar:1.2-SNAPSHOT]
```
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Jan 13, 2025
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40929 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c3b6cf1c8b7184d93c159082fa1a2d8a9ed7fae7, data reload: false

------ Round 1 ----------------------------------
q1	17584	7391	7218	7218
q2	2063	181	166	166
q3	10656	1071	1182	1071
q4	10555	737	718	718
q5	7742	2851	2838	2838
q6	237	154	156	154
q7	997	614	597	597
q8	9346	1959	2040	1959
q9	6564	6407	6380	6380
q10	7044	2288	2307	2288
q11	467	265	265	265
q12	409	217	221	217
q13	18032	2964	3051	2964
q14	251	223	207	207
q15	581	525	512	512
q16	707	615	607	607
q17	965	578	602	578
q18	7267	6746	6707	6707
q19	1405	1051	1017	1017
q20	492	221	203	203
q21	4081	3282	3304	3282
q22	1138	1023	981	981
Total cold run time: 108583 ms
Total hot run time: 40929 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7248	7205	7252	7205
q2	333	232	246	232
q3	2943	2939	2942	2939
q4	2138	1863	1869	1863
q5	5770	5707	5737	5707
q6	231	150	147	147
q7	2207	1837	1805	1805
q8	3329	3554	3495	3495
q9	8846	8928	8795	8795
q10	3600	3583	3519	3519
q11	606	510	504	504
q12	804	625	620	620
q13	9611	3178	3185	3178
q14	303	288	280	280
q15	576	532	514	514
q16	710	676	694	676
q17	1845	1605	1609	1605
q18	8301	7776	7443	7443
q19	1665	1483	1533	1483
q20	2116	1876	1886	1876
q21	5645	5492	5523	5492
q22	1172	1003	1044	1003
Total cold run time: 69999 ms
Total hot run time: 60381 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 198798 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c3b6cf1c8b7184d93c159082fa1a2d8a9ed7fae7, data reload: false

query1	1301	922	910	910
query2	6227	2076	2106	2076
query3	10959	4407	4519	4407
query4	66709	29106	23469	23469
query5	4989	444	464	444
query6	396	188	169	169
query7	5561	309	303	303
query8	307	233	235	233
query9	8891	2681	2671	2671
query10	465	286	270	270
query11	17166	15337	15694	15337
query12	167	110	105	105
query13	1479	447	427	427
query14	10918	7617	7414	7414
query15	210	187	191	187
query16	7175	501	535	501
query17	1080	599	614	599
query18	1978	338	322	322
query19	217	165	164	164
query20	120	118	112	112
query21	203	105	102	102
query22	4957	4695	4700	4695
query23	34650	34155	34671	34155
query24	6113	2979	2956	2956
query25	505	398	407	398
query26	696	166	173	166
query27	1848	350	358	350
query28	4189	2467	2428	2428
query29	689	466	458	458
query30	253	167	169	167
query31	1012	814	826	814
query32	65	58	58	58
query33	421	327	303	303
query34	926	503	525	503
query35	841	735	734	734
query36	1095	972	973	972
query37	122	80	69	69
query38	4182	4099	3946	3946
query39	1510	1481	1503	1481
query40	213	103	103	103
query41	50	54	48	48
query42	127	100	103	100
query43	559	510	507	507
query44	1198	809	829	809
query45	184	171	167	167
query46	1136	736	723	723
query47	2038	1931	1952	1931
query48	471	408	382	382
query49	735	387	388	387
query50	858	432	437	432
query51	7291	7201	6978	6978
query52	99	88	89	88
query53	254	178	183	178
query54	552	452	457	452
query55	79	75	78	75
query56	269	231	259	231
query57	1229	1096	1116	1096
query58	220	223	208	208
query59	3373	3098	3245	3098
query60	279	261	254	254
query61	107	114	106	106
query62	775	671	675	671
query63	219	191	205	191
query64	1380	698	646	646
query65	3248	3255	3236	3236
query66	706	305	305	305
query67	15977	15738	15830	15738
query68	3976	593	587	587
query69	434	268	260	260
query70	1208	1142	1128	1128
query71	358	267	252	252
query72	6389	4000	4006	4000
query73	762	350	349	349
query74	10121	9035	9125	9035
query75	3362	2712	2694	2694
query76	1803	1042	971	971
query77	473	274	274	274
query78	10563	9763	9611	9611
query79	1492	592	596	592
query80	880	430	417	417
query81	524	242	241	241
query82	1276	123	119	119
query83	243	144	144	144
query84	279	86	75	75
query85	918	309	295	295
query86	337	308	264	264
query87	4569	4350	4475	4350
query88	3918	2412	2399	2399
query89	412	296	288	288
query90	2002	185	188	185
query91	191	171	152	152
query92	69	47	49	47
query93	1948	554	561	554
query94	797	285	283	283
query95	359	265	259	259
query96	621	278	289	278
query97	3349	3206	3223	3206
query98	230	208	197	197
query99	1584	1279	1277	1277
Total cold run time: 319665 ms
Total hot run time: 198798 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.56 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c3b6cf1c8b7184d93c159082fa1a2d8a9ed7fae7, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.02	0.03
query3	0.23	0.07	0.07
query4	1.62	0.10	0.10
query5	0.51	0.50	0.54
query6	1.14	0.72	0.73
query7	0.03	0.02	0.02
query8	0.05	0.03	0.03
query9	0.57	0.52	0.49
query10	0.56	0.56	0.56
query11	0.14	0.11	0.11
query12	0.15	0.11	0.11
query13	0.61	0.60	0.60
query14	2.93	2.92	3.01
query15	0.88	0.83	0.82
query16	0.38	0.38	0.39
query17	1.02	1.04	1.05
query18	0.23	0.20	0.22
query19	1.84	1.90	1.99
query20	0.02	0.01	0.01
query21	15.36	0.60	0.58
query22	2.70	2.79	1.28
query23	17.06	1.13	0.84
query24	3.36	0.88	1.05
query25	0.29	0.08	0.08
query26	0.54	0.14	0.13
query27	0.03	0.06	0.04
query28	10.37	1.12	1.09
query29	12.56	3.25	3.21
query30	0.25	0.06	0.06
query31	2.87	0.39	0.38
query32	3.24	0.47	0.48
query33	3.01	3.05	2.99
query34	17.03	4.57	4.60
query35	4.62	4.56	4.57
query36	0.66	0.51	0.48
query37	0.10	0.06	0.06
query38	0.04	0.03	0.04
query39	0.03	0.03	0.02
query40	0.17	0.12	0.12
query41	0.08	0.03	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.02
Total cold run time: 107.45 s
Total hot run time: 32.56 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants