Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branch-3.0: [improve](cloud)(transaction) do not execute afterVisible if commit transaction fail in cloud mode #48576 #48774

Open
wants to merge 1 commit into
base: branch-3.0
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Mar 6, 2025

Cherry-picked from #48576

…ransaction fail in cloud mode (#48576)

### What problem does this PR solve?

When committing a transaction fails, the error message is difficult to
understand:

```
mysql> show routine load\G;

1. row ***************************
Id: 1737888537744
Name: lineitem_mow_persistent_label
CreateTime: 2025-01-27 10:08:54
PauseTime: NULL
EndTime: NULL
DbName: regression_test_stress_load_release_routine_load
TableName: lineitem_mow_persistent
IsMultiTable: false
State: RUNNING
DataSourceType: KAFKA
CurrentTaskNum: 16
JobProperties: {"max_batch_rows":"300000","timezone":"Asia/Shanghai","send_batch_parallelism":"1","load_to_single_tablet":"false","column_separator":"','","line_delimiter":"\n","current_concurrent_number":"16","delete":"","partial_columns":"false","merge_type":"APPEND","exec_mem_limit":"2147483648","strict_mode":"false","jsonpaths":"","max_batch_interval":"5","max_batch_size":"209715200","fuzzy_parse":"false","escape":"0","enclose":"0","partitions":"","columnToColumnExpr":"","whereExpr":"","desired_concurrent_number":"256","precedingFilter":"","format":"csv","max_error_number":"0","max_filter_ratio":"1.0","json_root":"","strip_outer_array":"false","num_as_string":"false"}
DataSourceProperties: {"topic":"test-release-mow-topic-persistent","currentKafkaPartitions":"0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15","brokerList":"172.20.48.94:9092"}
CustomProperties: {"group.id":"test-consumer-group","kafka_default_offsets":"OFFSET_BEGINNING","client.id":"test-client-id"}
Statistic:

{"receivedBytes":2354664774,"runningTxns":[429816543801345,429918945479680,429959949536256,429837189331968,429918980583432,429888224712705,429837030172683,429898466490368,429888240318465,429959934614528,429853033613312,429816559634432,429844701694976,429908791357441,429826474905600,429908706769920,429959919939585,429826798930944,429888254461953,429908722637824,429856848327680,429908772813824,429816574341120,429846679480320,429826783170560,429929185043456,429908737931264,429908754749440,429898495263744,429888269763584,429837131176960,429918961759242,429860108133376,429849127204864,429837081730048,429959905823745],"errorRows":0,"committedTaskNum":140,"loadedRows":12320899,"loadRowsRate":2204,"abortedTaskNum":120,"errorRowsAfterResumed":0,"totalRows":12320899,"unselectedRows":0,"receivedBytesRate":421293,"taskExecuteTimeMs":5589130}

Progress: {"0":"781664","1":"786801","2":"628823","3":"871308","4":"780769","5":"839093","6":"613342","7":"708207","8":"783686","9":"692121","10":"783018","11":"853209","12":"732847","13":"845036","14":"848547","15":"772412"}
Lag: {"0":10282,"1":16796,"2":256,"3":12460,"4":15437,"5":11465,"6":1346,"7":11647,"8":12285,"9":15834,"10":12916,"11":12750,"12":13712,"13":16845,"14":11884,"15":9546}
ReasonOfStateChanged:
ErrorLogUrls:
OtherMsg: 2025-01-27 11:41:50:[INTERNAL_ERROR]TStatus: Cannot invoke "org.apache.doris.transaction.TransactionState.getTransactionId()" because "txnState" is null
 
0# doris::Status doris::Status::create<true>(doris::TStatus const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
1# doris::StreamLoadExecutor::commit_txn(doris::StreamLoadContext*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/status.h:511
2# doris::CloudStreamLoadExecutor::commit_txn(doris::StreamLoadContext*) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/status.h:390
3# doris::RoutineLoadTaskExecutor::exec_task(std::shared_ptr<doris::StreamLoadContext>, doris::DataConsumerPool*, std::function<void (std::shared_ptr<doris::StreamLoadContext>)>) at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/status.h:511
4# std::_Function_handler<void (), std::_Bind_result<void, void (doris::RoutineLoadTaskExecutor::(doris::RoutineLoadTaskExecutor, std::shared_ptr<doris::StreamLoadContext>, doris::DataConsumerPool*, doris::RoutineLoadTaskExecutor::submit_task(doris::TRoutineLoadTask const&)::$_0))(std::shared_ptr<doris::StreamLoadContext>, doris::DataConsumerPool*, std::function<void (std::shared_ptr<doris::StreamLoadContext>)>)> >::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:244
5# doris::ThreadPool::dispatch_thread() at /home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/threadpool.cpp:0
6# doris::Thread::supervise_thread(void*) at /var/local/ldb-toolchain/bin/../usr/include/pthread.h:562
7# ?
8# ?
 
User: root
Comment:
1 row in set (0.01 sec)
```

The reason is `txnState.getTransactionId()` will be executed in
`afterVisible`. It is possible to determine whether `txnState` is null
before executing logic `txnState.getTransactionId()`, but it is more
reasonable to do not execute `afterVisible` if commit transaction fail
in cloud mode.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@github-actions github-actions bot requested a review from dataroaring as a code owner March 6, 2025 11:03
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring closed this Mar 6, 2025
@dataroaring dataroaring reopened this Mar 6, 2025
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39939 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e5baf42c7a70ac123f4cbff46ebaa1dc691adf45, data reload: false

------ Round 1 ----------------------------------
q1	17582	6723	6595	6595
q2	2082	173	156	156
q3	10657	1105	1139	1105
q4	10547	746	733	733
q5	7738	2849	2789	2789
q6	220	135	133	133
q7	970	620	612	612
q8	9360	1932	2062	1932
q9	6581	6445	6380	6380
q10	7005	2241	2260	2241
q11	459	266	263	263
q12	409	224	220	220
q13	17801	2974	3006	2974
q14	225	209	205	205
q15	487	470	460	460
q16	665	582	583	582
q17	964	537	560	537
q18	7257	6561	6643	6561
q19	1394	1126	984	984
q20	470	204	195	195
q21	4064	3287	3326	3287
q22	1106	995	999	995
Total cold run time: 108043 ms
Total hot run time: 39939 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6561	6602	6562	6562
q2	332	243	237	237
q3	2919	2746	2875	2746
q4	2070	1796	1767	1767
q5	5788	5702	5674	5674
q6	204	126	128	126
q7	2234	1852	1841	1841
q8	3348	3566	3498	3498
q9	8955	8808	8899	8808
q10	3586	3514	3534	3514
q11	604	502	498	498
q12	813	593	608	593
q13	9099	3189	3198	3189
q14	302	272	289	272
q15	531	464	478	464
q16	690	647	639	639
q17	1834	1608	1602	1602
q18	8191	7865	7711	7711
q19	1672	1519	1502	1502
q20	2070	1846	1885	1846
q21	5525	5284	5284	5284
q22	1165	1076	1050	1050
Total cold run time: 68493 ms
Total hot run time: 59423 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 197565 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e5baf42c7a70ac123f4cbff46ebaa1dc691adf45, data reload: false

query1	1288	930	881	881
query2	6226	2110	2163	2110
query3	10965	4425	4483	4425
query4	61228	28584	23670	23670
query5	5177	462	455	455
query6	418	185	171	171
query7	5476	312	307	307
query8	307	224	218	218
query9	8616	2611	2584	2584
query10	465	261	250	250
query11	17756	15199	15798	15199
query12	157	102	103	102
query13	1450	455	424	424
query14	10633	7447	7291	7291
query15	194	181	176	176
query16	7214	480	482	480
query17	1160	601	599	599
query18	1908	330	311	311
query19	210	167	160	160
query20	122	115	114	114
query21	209	104	114	104
query22	4647	4329	4718	4329
query23	34483	33891	33857	33857
query24	6173	2898	2965	2898
query25	554	428	436	428
query26	654	173	176	173
query27	1886	362	362	362
query28	4150	2509	2423	2423
query29	724	491	462	462
query30	252	166	165	165
query31	1000	834	859	834
query32	69	59	61	59
query33	415	286	294	286
query34	958	516	522	516
query35	840	729	745	729
query36	1085	951	966	951
query37	122	67	66	66
query38	4102	3961	4083	3961
query39	1496	1534	1457	1457
query40	204	102	98	98
query41	49	47	48	47
query42	110	99	110	99
query43	531	508	517	508
query44	1178	824	828	824
query45	193	171	171	171
query46	1170	761	736	736
query47	2028	1930	1924	1924
query48	476	401	403	401
query49	741	394	389	389
query50	854	434	432	432
query51	7470	7176	7294	7176
query52	100	91	87	87
query53	267	190	191	190
query54	577	462	474	462
query55	84	83	85	83
query56	265	268	254	254
query57	1282	1160	1154	1154
query58	224	200	220	200
query59	3237	2964	2921	2921
query60	275	251	249	249
query61	131	108	109	108
query62	754	682	678	678
query63	217	187	195	187
query64	1408	661	661	661
query65	3291	3164	3185	3164
query66	703	305	299	299
query67	15830	15510	15473	15473
query68	4235	599	583	583
query69	425	270	273	270
query70	1212	1072	1076	1072
query71	339	263	255	255
query72	6348	4039	4175	4039
query73	758	358	357	357
query74	10290	9186	8921	8921
query75	3391	2643	2668	2643
query76	1966	1011	1098	1011
query77	498	289	276	276
query78	10708	9617	9643	9617
query79	1347	624	618	618
query80	839	435	438	435
query81	517	244	235	235
query82	1259	94	88	88
query83	250	143	146	143
query84	283	80	76	76
query85	896	322	298	298
query86	328	304	294	294
query87	4368	4239	4247	4239
query88	3621	2413	2392	2392
query89	423	299	293	293
query90	2015	187	188	187
query91	185	148	149	148
query92	64	49	52	49
query93	1579	552	552	552
query94	753	295	292	292
query95	364	271	260	260
query96	610	278	284	278
query97	3294	3225	3209	3209
query98	211	208	208	208
query99	1527	1305	1265	1265
Total cold run time: 313341 ms
Total hot run time: 197565 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.24 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e5baf42c7a70ac123f4cbff46ebaa1dc691adf45, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.03	0.03
query3	0.23	0.06	0.06
query4	1.64	0.10	0.10
query5	0.53	0.50	0.51
query6	1.13	0.73	0.73
query7	0.02	0.01	0.01
query8	0.04	0.03	0.04
query9	0.56	0.50	0.49
query10	0.55	0.56	0.56
query11	0.14	0.10	0.10
query12	0.14	0.12	0.12
query13	0.60	0.59	0.59
query14	2.76	2.73	2.76
query15	0.89	0.83	0.82
query16	0.42	0.38	0.37
query17	1.07	1.06	0.99
query18	0.23	0.22	0.21
query19	1.98	1.77	1.99
query20	0.01	0.01	0.01
query21	15.35	0.61	0.57
query22	2.29	2.27	1.97
query23	16.87	0.98	0.74
query24	3.62	1.16	1.20
query25	0.17	0.05	0.16
query26	0.60	0.14	0.14
query27	0.04	0.06	0.04
query28	9.85	0.58	0.50
query29	12.63	3.23	3.25
query30	0.25	0.05	0.06
query31	2.85	0.38	0.38
query32	3.26	0.46	0.46
query33	2.99	3.02	2.94
query34	16.94	4.48	4.49
query35	4.57	4.51	4.53
query36	0.67	0.50	0.47
query37	0.10	0.06	0.05
query38	0.04	0.03	0.04
query39	0.04	0.03	0.02
query40	0.16	0.14	0.13
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.49 s
Total hot run time: 32.24 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants